Three bugs were opposed to this functionality.  Two were simple,
(char 0 in HtIs...WordChar mishandled, not calling removeIndex in
normalizePath).  For the third, some mumbles (better communicated
than just thought) may be in order.

There was a bug in the parsing of URLs, before calling
Retriever::got_href.  I believe that URL::parse should reset
the contents (the member variables) before extracting the
different parts.

If not, for example the _normal member was kept when parsing
new contents into an old instance or when using a "parent" URL;
effectively stopping the remove_index functionality.  This is
most obvious when using external parsers, but the same effect
was present when using HTML.cc

Although it is unclear what the "_normal" member-flag is
supposed to indicate (normalized server?  normalized path?
both?) I believe it is should indicate an URL that has been
inspected and completely fixed with redards to *all*
attributes, or another flag is needed.  At present, my changes
reflect that the server has been "normalized", which seemed to
be the original intent.  Changing to reflect "all normalized"
would mean gethostbyname calls for documents with relative URLs
on the same server, which is costly.
 The "_normal" member is set to 1 only in URL::normalize now.
It also used to be set in URL::parse if the URL ended after the
first ":" (or is empty), or did not contain a host part
(contained a "//"); and then URL::parse *returned*.  This was
obviously buggy and/or incomplete since that can only work for
some cases if URL::parse was called from URL::URL(char *ref, URL
&parent), where the URL gets "reconstructed" the same way that
URL::parse would do later.  In no case was the URL "normal".

I believe URL::parse and URL::URL(char *ref, URL &parent) should
be unified; setting defaults and call a common parse method
would clean up some.

I changed (in CVS now) the URL::parse() method to reset most
members before commencing with parsing.

I also changed ExternalParser to set the URL _hopcount member
separately; it was previously not handled; all hopcounts were 0.
I believe it is debatable whether _hopcount should at all be
part of the URL class; an URL does not intuitively have a
"hopcount" attribute IMHO.

This will have the effect that external parsers may break if
they do not do as the documentation tells - provide a *complete
absolute* URL.  If they provided a relative URL, they may have
worked by accident before.
 If support for relative URLs in the 'u' field is wanted, then I
guess the documentation needs fixing, and the ExternalParser
code needs to be changed to use the URL::URL(char *ref, URL
&parent) similar to what is done in HTML.cc.

My changes may also have uncovered other bugs related to
handling of URLs, but now people (hopefully) have a better clue
if/when that happens.

brgds, H-P

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to