Yes now that I understand what's going on, I belive
Nutch is doing the right thing too - no bug :)

You got me thinking about the base url tag and I think
the problem is that the page either should have an
explicit base tag to http://www.xxxxx.com/xxxxx.cgi or
all their relative urls should start with xxxxx.cgi,
then crawlers could resolve the url correctly.

Somehow my browser resolves it, but I don't know how. 
In any case I figure a workaround would be to just do
a regex subsitution in the regex-normalize.xml file
which will reconstruct the correct url.

thanks for the help
raymond

--- [EMAIL PROTECTED] wrote:

> I think Nutch is behaving correctly.
> Maybe that page has a BASE URL (view source, look at
> the HEAD elements)
> that throws off one or the other.
> 
> Otis
> 
> 
> --- Raymond Creel <[EMAIL PROTECTED]> wrote:
> 
> > Has any one experience a problem with the way the
> > standard html parser plugin handles relative urls?
> > 
> > There is a site where the home page is something
> like
> > 
> > http://www.xxxxx.com/xxxxx.cgi
> > 
> > and when browsing a link with its href set to
> > 
> > '?paramname=paramvalue'
> > 
> > a browser will naturally take you to
> > 
> >
> http://www.xxxxx.com/xxxxx.cgi?paramname=paramvalue
> > 
> > However, in nutch when the outlinks are parsed
> from
> > the page the link ends up being
> > 
> > http://www.xxxxx.com/?paramname=paramvalue
> > 
> > which of course is broken.  So why is the
> xxxxx.cgi
> > gone?  Is this a bug or am I missing something?
> > 
> > Thanks
> > 
> > 
> > 
> >             
> >
> ____________________________________________________
> > Start your day with Yahoo! - make it your home
> page 
> > http://www.yahoo.com/r/hs 
> >  
> > 
> > 
> >
>
-------------------------------------------------------
> > SF.Net email is Sponsored by the Better Software
> Conference & EXPO
> > September
> > 19-22, 2005 * San Francisco, CA * Development
> Lifecycle Practices
> > Agile & Plan-Driven Development * Managing
> Projects & Teams * Testing
> > & QA
> > Security * Process Improvement & Measurement *
> > http://www.sqe.com/bsce5sf
> > _______________________________________________
> > Nutch-general mailing list
> > [email protected]
> >
>
https://lists.sourceforge.net/lists/listinfo/nutch-general
> > 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO September
19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to