Hi,

Am facing very strange problem while crawling.I am crawling only urls ending
in .htm and .html, so i have set following filters in my
regex-urlfilter.txt.

+http://([a-z0-9]*\.)*example.com/.*\.htm$
+http://([a-z0-9]*\.)*example.com/.*\.html$

while crawling,If i append $ in htm$ it give me only 240 urls while if i
dont used $ at the end it gives me 850 urls. Am interested in the result of
the latter case.(it gives me all the required urls)Only problem arise in
that is some urls are duplicated..not entirely just ( ? ) is the difference
between them(only 20-30 urls) . remaining are stricly urls ending in .htm
and .html.

Secondly using same configuration file gives different result  in eclipse
and my custom java program.

Can anybody tell me whats wrong over here..

Awaiting for some valuable suggestion/answers.

Cheers,
Cha


-- 
View this message in context: 
http://www.nabble.com/strange-problem-while-crawling-tf3716520.html#a10396878
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to