Hi, Am facing very strange problem while crawling.I am crawling only urls ending in .htm and .html, so i have set following filters in my regex-urlfilter.txt.
+http://([a-z0-9]*\.)*example.com/.*\.htm$ +http://([a-z0-9]*\.)*example.com/.*\.html$ while crawling,If i append $ in htm$ it give me only 240 urls while if i dont used $ at the end it gives me 850 urls. Am interested in the result of the latter case.(it gives me all the required urls)Only problem arise in that is some urls are duplicated..not entirely just ( ? ) is the difference between them(only 20-30 urls) . remaining are stricly urls ending in .htm and .html. Secondly using same configuration file gives different result in eclipse and my custom java program. Can anybody tell me whats wrong over here.. Awaiting for some valuable suggestion/answers. Cheers, Cha -- View this message in context: http://www.nabble.com/strange-problem-while-crawling-tf3716520.html#a10396878 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
