[Nutch-general] strange problem while crawling

cha Wed, 09 May 2007 08:43:33 -0700

Hi,

Am facing very strange problem while crawling.I am crawling only urls ending
in .htm and .html, so i have set following filters in my
regex-urlfilter.txt.


+http://([a-z0-9]*\.)*example.com/.*\.htm$
+http://([a-z0-9]*\.)*example.com/.*\.html$

while crawling,If i append $ in htm$ it give me only 240 urls while if i
dont used $ at the end it gives me 850 urls. Am interested in the result of
the latter case.(it gives me all the required urls)Only problem arise in
that is some urls are duplicated..not entirely just ( ? ) is the difference
between them(only 20-30 urls) . remaining are stricly urls ending in .htm
and .html.

Secondly using same configuration file gives different result  in eclipse
and my custom java program.

Can anybody tell me whats wrong over here..

Awaiting for some valuable suggestion/answers.

Cheers,
Cha


-- 
View this message in context: 
http://www.nabble.com/strange-problem-while-crawling-tf3716520.html#a10396878
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] strange problem while crawling

Reply via email to