Emma Jane Hogbin's bits of Thu, 7 Mar 2002 translated to: >>OK, but what exactly is on the page? It certainly didn't find anything >>significant to index or links other than the images you pointed out. > >The page has four bread crumb items, a bunch of image navigation buttons, >eight left nav text links, and over 20 text links (in a list). None of the >words on the page are getting put into the word db. i.e. the page has a >list of Colleges and none of the names of the colleges show up when I do a >search. > >>Either the HTML parser is missing a lot, or there isn't much on the page >>to index. > >I think it's the first option, which scares me. :(
It does look like there is a problem with the parser. If a '<' occurs in a script element, it appears that the parser becomes somewhat confused with regard to the remaining document content. For example <head> <title>Title</title> <script language="javascript"> var i; for ( i = 0; i < 5; i++ ) {} </script> </head> results in the parser missing all remaining links on the page. If the '<' is removed or replaced (e.g. with a '>'), the page is properly indexed. This occurs with 3.1.6; I haven't tried it with a 3.2.0b4 snapshot. Assuming that this is in fact a bug rather than a misunderstanding of expected functionality, and the cause of problem is not obvious, I would be willing to do a bit of debugging. Jim _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev