Tatu,

Thanks for the reply. See below for comments.

> just ignore everything inside of <script> tags.  It appears that the parser
> is ignoring text inside script tags, but it seems like it needs to be a bit
> smarter (or maybe dumber) about how it deals with this (so it doesn't get

I would guess that often ignoring stuff in <script> (for indexing purposes)
makes sense; exception being if someone wants to create HTML site creation
IDE (like specifically wants to search for stuff in javascript sections?).
Nonetheless HTML parser has to be able to handle these I think.

Fortunately, the sole purpose of the parser that ships with Lucene is indexing HTML documents. As such, I see no reason to worry about functionality for other use cases (i.e. IDE development). There are plenty of other parsers out there that try to be complete. It would be great if this one was optimized for the task at hand (and thus can ignore text inside <script> tags).


> confused by such occurrences).  I see a bug has been filed regarding
> trouble parsing JavaScript, has anyone given it thought?

If anyone would be interested I could give the source code and/or (if I have
time) to implement efficient fault-tolerant indexer.
Like I said this also works equally well for well-formed XML, but that's
nothing special.

I'd definitely be interested to see what you did. My application needs to index "public" documents as users submit requests (eventually 1000's per day), so I don't have control over the HTML (i.e. it needs to be fault tolerant) and it needs to be efficient. Parsing a big page (i.e. http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html) is another good way to stress the basic parsers (some are frighteningly CPU intensive).


Even though I think a solid HTML parser that is optimized for the task of indexing is actually quite important to Lucene, we can take any further discussions off-line as they are probably not deemed relevant to the Lucene list.

-Mike




--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to