Yes now that I understand what's going on, I belive Nutch is doing the right thing too - no bug :)
You got me thinking about the base url tag and I think the problem is that the page either should have an explicit base tag to http://www.xxxxx.com/xxxxx.cgi or all their relative urls should start with xxxxx.cgi, then crawlers could resolve the url correctly. Somehow my browser resolves it, but I don't know how. In any case I figure a workaround would be to just do a regex subsitution in the regex-normalize.xml file which will reconstruct the correct url. thanks for the help raymond --- [EMAIL PROTECTED] wrote: > I think Nutch is behaving correctly. > Maybe that page has a BASE URL (view source, look at > the HEAD elements) > that throws off one or the other. > > Otis > > > --- Raymond Creel <[EMAIL PROTECTED]> wrote: > > > Has any one experience a problem with the way the > > standard html parser plugin handles relative urls? > > > > There is a site where the home page is something > like > > > > http://www.xxxxx.com/xxxxx.cgi > > > > and when browsing a link with its href set to > > > > '?paramname=paramvalue' > > > > a browser will naturally take you to > > > > > http://www.xxxxx.com/xxxxx.cgi?paramname=paramvalue > > > > However, in nutch when the outlinks are parsed > from > > the page the link ends up being > > > > http://www.xxxxx.com/?paramname=paramvalue > > > > which of course is broken. So why is the > xxxxx.cgi > > gone? Is this a bug or am I missing something? > > > > Thanks > > > > > > > > > > > ____________________________________________________ > > Start your day with Yahoo! - make it your home > page > > http://www.yahoo.com/r/hs > > > > > > > > > ------------------------------------------------------- > > SF.Net email is Sponsored by the Better Software > Conference & EXPO > > September > > 19-22, 2005 * San Francisco, CA * Development > Lifecycle Practices > > Agile & Plan-Driven Development * Managing > Projects & Teams * Testing > > & QA > > Security * Process Improvement & Measurement * > > http://www.sqe.com/bsce5sf > > _______________________________________________ > > Nutch-general mailing list > > [email protected] > > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
