[Nutch-general] RE: Meta-Refresh Question

Dennis Kubes Tue, 04 Apr 2006 10:33:54 -0700

I searched through the code and the problem is the URL returned for the
meta-refresh is like this:

http://www.oneforever.com/tohomepage.do;jsessionid=F3C8BBAC224990A9214A1785E
5001AFD

Which matches the RegexURLFilter for this pattern:

[EMAIL PROTECTED] (because of the = sign

So my question is should the URL be cleaned up inside of the HttpBase where
it is grabbed from the page content or would it be better to put in a URL
filter to match before it gets eliminated by the filter above?

Dennis

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 04, 2006 9:56 AM
To: [email protected]
Subject: Re: Meta-Refresh Question

Dennis Kubes wrote:
> Silly question but nutch won't follow meta-refreshes will it?
>

It should have, parse-html has support for this
(ParseStatus.SUCCESS_REDIRECT), and it did work in 0.7, but now I can see
that one of the necessary pieces (in Fetcher) didn't make it to 0.8.
Please create a JIRA issue so that it doesn't escape our attention.
Thank you!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Meta-Refresh Question

Reply via email to