Emma Jane Hogbin's bits of Thu, 7 Mar 2002 translated to:

>>OK, but what exactly is on the page? It certainly didn't find anything
>>significant to index or links other than the images you pointed out.
>
>The page has four bread crumb items, a bunch of image navigation buttons,
>eight left nav text links, and over 20 text links (in a list). None of the
>words on the page are getting put into the word db. i.e. the page has a
>list of Colleges and none of the names of the colleges show up when I do a
>search.
>
>>Either the HTML parser is missing a lot, or there isn't much on the page
>>to index.
>
>I think it's the first option, which scares me. :(

It does look like there is a problem with the parser. If a '<'
occurs in a script element, it appears that the parser becomes
somewhat confused with regard to the remaining document content.
For example

<head>
<title>Title</title>
<script language="javascript">
var i;
for ( i = 0; i < 5; i++ ) {}
</script>
</head>

results in the parser missing all remaining links on the page. If
the '<' is removed or replaced (e.g. with a '>'), the page is
properly indexed. This occurs with 3.1.6; I haven't tried it with
a 3.2.0b4 snapshot.

Assuming that this is in fact a bug rather than a misunderstanding
of expected functionality, and the cause of problem is not obvious,
I would be willing to do a bit of debugging.

Jim


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to