According to [EMAIL PROTECTED]:
> I have never had a problem with htdig and spaces in filenames, but in trying
> to index a set of html files on our intranet site, I encountered some
> problems.
> It appears as though htdig has removed the white spaces in the filenames
> that I am trying to index. I have never seen this happen before and am
> unable to figure out exactly why it is happening. I have tried several
> different solutions and none seem to change anything. I am using version
> 3.1.5
>
> Here is a section of what is going on using -vvv.
>
> +A tag: pos = 2, position = ="Lee County.html">
> href: http://intranet/locality_test/LeeCounty.html (Lee County)
> resolving 'http://intranet/locality_test/LeeCounty.html'
>
> pushing http://intranet/locality_test/LeeCounty.html
Hmm. This could be seen as a bug, or it may be that htdig correctly
implements the standard wereas many HTML parsers and HTML generators
don't. Normally, embedded spaces in URLs should be encoded as %20,
and that would avoid this problem.
RFC 1738, which defines the standard URL format, says this:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
and
In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
need to be added to break long URLs across lines. The whitespace
should be ignored when extracting the URL.
The newer RFC 2396 (URI Generic Syntax) says pretty much the same thing,
so according to a strict interpretation of the standard, htdig does the
right thing. The HTML 2.0 and 4.01 standards simply refer to these RFCs
when it comes time to defining the format of the HREF attribute in <A>
and <LINK> tags.
However, as far as I know URLs with embedded spaces are still treated
as valid by many applications, so from a pragmatic sense what htdig is
doing could be seen as wrong.
The source of the problem would appear to be in the URL class, in
htlib/URL.cc (or htcommon/URL.cc in 3.2 betas). The URL::URL(url, parent)
constructor and the URL::parse(u) method both have String::remove()
calls in them that strip out all white space characters, i.e.:
temp.remove(" \r\n\t"); (in URL::URL(url, parent))
and
temp.remove(" \t\r\n"); (in URL::parse(u))
These have been there since 3.0.8b2, which is the oldest version for
which I have source, so we can probably assume they've been there
from the beginning, and that Andrew followed the RFCs to the letter.
The question is, can they be safely taken out? At the very least, you
could take the space out of the string in those two remove calls and see
if that solves the problem. If you still want to remove trailing spaces,
then maybe the remove call without a space in the string can be followed
by a chop call, e.g, in both methods do this:
temp.remove("\r\n\t");
temp.chop(' ');
However, the RFC does seem to be addressing any embedded white space,
not just leading or trailing space. Does anyone else have any thoughts
about how htdig ought to deal with these non-conforming URLs?
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html