strange problem while crawling

cha Wed, 09 May 2007 08:43:23 -0700

Hi,

Am facing very strange problem while crawling.I am crawling only urls ending
in .htm and .html, so i have set following filters in my
regex-urlfilter.txt.


+http://([a-z0-9]*\.)*example.com/.*\.htm$
+http://([a-z0-9]*\.)*example.com/.*\.html$

while crawling,If i append $ in htm$ it give me only 240 urls while if i
dont used $ at the end it gives me 850 urls. Am interested in the result of
the latter case.(it gives me all the required urls)Only problem arise in
that is some urls are duplicated..not entirely just ( ? ) is the difference
between them(only 20-30 urls) . remaining are stricly urls ending in .htm
and .html.

Secondly using same configuration file gives different result  in eclipse
and my custom java program.

Can anybody tell me whats wrong over here..

Awaiting for some valuable suggestion/answers.

Cheers,
Cha


-- 
View this message in context: 
http://www.nabble.com/strange-problem-while-crawling-tf3716520.html#a10396878
Sent from the Nutch - User mailing list archive at Nabble.com.

strange problem while crawling

Reply via email to