According to Joe R. Jah:
> On Sat, 9 Mar 2002, Geoff Hutchison wrote:
> > On Friday, March 8, 2002, at 01:51  PM, Joe R. Jah wrote:
> > > Unfortunately htdig removes the space. and looks for "filename.html" and
> > > reports:
> > >
> > > Not found: http://domain.com/some/path/filename.html Ref: 
> > > http://domain.com/some/path/file.html
> > 
> > Joe, I think you should understand that this isn't much help as a bug 
> > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the 
> > space seem to "disappear?" Is it when it first encounters the link 
> > (parser error), as it normalizes and accepts/rejects the URL (retriever 
> > or URL parser error) or as it tries to fetch it?
> > 
> > A bit more feedback would go a long way towards debugging this.
> 
> Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one
> file:
> ----------------------------------8<-------------------------------
> 0:0:0:http://domain.com/Path/To/: Trying local files
>   tried local file /domain.com/Path/To/index.html
>   tried local file /domain.com/Path/To/index.shtml
>   found existing file /domain.com/Path/To/index.htm
> Read 5785 from document
> Read a total of 5785 bytes
> Tag: <html>, matched -1
> Tag: <head>, matched -1
> Tag: <title>, matched 0
> word: Handouts@7
> Tag: </title>, matched 1
> title: Handouts
> Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2
> word: Basic@696
> word: UNIX@698
> word: Commands@700
> Tag: </a>, matched 3
> href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX
> Commands)
> resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm'
>    pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm
> ----------------------------------8<-------------------------------
> ...
> ----------------------------------8<-------------------------------
> 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files
>   tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm
> Local retrieval failed, trying HTTP
> Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET 
>/Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0
> User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
> Referer: http://domain.com/Path/To/
> Host: domain.com   
> 
> Header line: HTTP/1.1 404 Not Found
> Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT
> ----------------------------------8<-------------------------------
> 
> And it reports:
> ----------------------------------8<-------------------------------
> Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: 
>http://domain.com/Path/To/
> ----------------------------------8<-------------------------------

What most browsers do with unencoded spaces within URLs is a violation of
RFC 1738 and RFC 2396.  htdig does the correct thing, if not what some
users would prefer it did.  You can of course patch the URL class to leave
the spaces in there, in violation of the standard, to conform with the
incorrect behaviour of most browsers and, apparently, some really bad
HTML code generators.  That would save you from having to fix all the bad
HTML code you're indexing.  Spaces within URLs should always always be
encoded as %20.

See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/
and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/

My recommendation, if you have a choice, is to avoid spaces in filenames
altogether, because they cause all sorts of grief.  Some caching proxy
servers mess up URLs with spaces, even if the space is properly encoded
as %20.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to