According to [EMAIL PROTECTED]:
> I have never had a problem with htdig and spaces in filenames, but in trying
> to index a set of html files on our intranet site, I encountered some
> problems.  
> It appears as though htdig has removed the white spaces in the filenames
> that I am trying to index.  I have never seen this happen before and am
> unable to figure out exactly why it is happening.  I have tried several
> different solutions and none seem to change anything.  I am using version
> 3.1.5
>  
> Here is a section of what is going on using -vvv.
>  
> +A tag: pos = 2, position = ="Lee County.html">
> href: http://intranet/locality_test/LeeCounty.html (Lee County)
> resolving 'http://intranet/locality_test/LeeCounty.html'
>  
>    pushing http://intranet/locality_test/LeeCounty.html

Hmm.  This could be seen as a bug, or it may be that htdig correctly
implements the standard wereas many HTML parsers and HTML generators
don't.  Normally, embedded spaces in URLs should be encoded as %20,
and that would avoid this problem.

RFC 1738, which defines the standard URL format, says this:

   Characters can be unsafe for a number of reasons.  The space
   character is unsafe because significant spaces may disappear and
   insignificant spaces may be introduced when URLs are transcribed or
   typeset or subjected to the treatment of word-processing programs.

and

   In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
   need to be added to break long URLs across lines.  The whitespace
   should be ignored when extracting the URL.

The newer RFC 2396 (URI Generic Syntax) says pretty much the same thing,
so according to a strict interpretation of the standard, htdig does the
right thing.  The HTML 2.0 and 4.01 standards simply refer to these RFCs
when it comes time to defining the format of the HREF attribute in <A>
and <LINK> tags.

However, as far as I know URLs with embedded spaces are still treated
as valid by many applications, so from a pragmatic sense what htdig is
doing could be seen as wrong.

The source of the problem would appear to be in the URL class, in
htlib/URL.cc (or htcommon/URL.cc in 3.2 betas).  The URL::URL(url, parent)
constructor and the URL::parse(u) method both have String::remove()
calls in them that strip out all white space characters, i.e.:

    temp.remove(" \r\n\t");             (in URL::URL(url, parent))

and

    temp.remove(" \t\r\n");             (in URL::parse(u))

These have been there since 3.0.8b2, which is the oldest version for
which I have source, so we can probably assume they've been there
from the beginning, and that Andrew followed the RFCs to the letter.
The question is, can they be safely taken out?  At the very least, you
could take the space out of the string in those two remove calls and see
if that solves the problem.  If you still want to remove trailing spaces,
then maybe the remove call without a space in the string can be followed
by a chop call, e.g, in both methods do this:

    temp.remove("\r\n\t");
    temp.chop(' ');

However, the RFC does seem to be addressing any embedded white space,
not just leading or trailing space.  Does anyone else have any thoughts
about how htdig ought to deal with these non-conforming URLs?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to