According to Joe R. Jah: > On Sat, 9 Mar 2002, Geoff Hutchison wrote: > > On Friday, March 8, 2002, at 01:51 PM, Joe R. Jah wrote: > > > Unfortunately htdig removes the space. and looks for "filename.html" and > > > reports: > > > > > > Not found: http://domain.com/some/path/filename.html Ref: > > > http://domain.com/some/path/file.html > > > > Joe, I think you should understand that this isn't much help as a bug > > report. Do you see this in 3.1.x, 3.2.0bX, both, etc.? When does the > > space seem to "disappear?" Is it when it first encounters the link > > (parser error), as it normalizes and accepts/rejects the URL (retriever > > or URL parser error) or as it tries to fetch it? > > > > A bit more feedback would go a long way towards debugging this. > > Ok, I run 3.1.6, rundig -vvvvv results the following for one link in one > file: > ----------------------------------8<------------------------------- > 0:0:0:http://domain.com/Path/To/: Trying local files > tried local file /domain.com/Path/To/index.html > tried local file /domain.com/Path/To/index.shtml > found existing file /domain.com/Path/To/index.htm > Read 5785 from document > Read a total of 5785 bytes > Tag: <html>, matched -1 > Tag: <head>, matched -1 > Tag: <title>, matched 0 > word: Handouts@7 > Tag: </title>, matched 1 > title: Handouts > Tag: <a href="fa01HP2-Basic Unix Commands.htm">, matched 2 > word: Basic@696 > word: UNIX@698 > word: Commands@700 > Tag: </a>, matched 3 > href: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm (Basic UNIX > Commands) > resolving 'http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm' > pushing http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > ----------------------------------8<------------------------------- > ... > ----------------------------------8<------------------------------- > 14:14:1:http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: Trying local files > tried local file /domain.com/Path/To/fa01HP2-BasicUnixCommands.htm > Local retrieval failed, trying HTTP > Retrieval command for http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm: GET >/Path/To/fa01HP2-BasicUnixCommands.htm HTTP/1.0 > User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) > Referer: http://domain.com/Path/To/ > Host: domain.com > > Header line: HTTP/1.1 404 Not Found > Header line: Date: Sun, 10 Mar 2002 08:03:36 GMT > ----------------------------------8<------------------------------- > > And it reports: > ----------------------------------8<------------------------------- > Not found: http://domain.com/Path/To/fa01HP2-BasicUnixCommands.htm Ref: >http://domain.com/Path/To/ > ----------------------------------8<-------------------------------
What most browsers do with unencoded spaces within URLs is a violation of RFC 1738 and RFC 2396. htdig does the correct thing, if not what some users would prefer it did. You can of course patch the URL class to leave the spaces in there, in violation of the standard, to conform with the incorrect behaviour of most browsers and, apparently, some really bad HTML code generators. That would save you from having to fix all the bad HTML code you're indexing. Spaces within URLs should always always be encoded as %20. See http://www.geocrawler.com/archives/3/8822/2002/1/300/7455555/ and http://www.geocrawler.com/archives/3/8822/2002/1/250/7495651/ My recommendation, if you have a choice, is to avoid spaces in filenames altogether, because they cause all sorts of grief. Some caching proxy servers mess up URLs with spaces, even if the space is properly encoded as %20. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

