I searched through the code and the problem is the URL returned for the meta-refresh is like this:
http://www.oneforever.com/tohomepage.do;jsessionid=F3C8BBAC224990A9214A1785E 5001AFD Which matches the RegexURLFilter for this pattern: [EMAIL PROTECTED] (because of the = sign So my question is should the URL be cleaned up inside of the HttpBase where it is grabbed from the page content or would it be better to put in a URL filter to match before it gets eliminated by the filter above? Dennis -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 04, 2006 9:56 AM To: [email protected] Subject: Re: Meta-Refresh Question Dennis Kubes wrote: > Silly question but nutch won't follow meta-refreshes will it? > It should have, parse-html has support for this (ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can see that one of the necessary pieces (in Fetcher) didn't make it to 0.8. Please create a JIRA issue so that it doesn't escape our attention. Thank you! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
