Re: HTMLParser

solprovider Thu, 12 May 2005 15:48:27 -0700

On 5/12/05, Robert Goene <[EMAIL PROTECTED]> wrote:
> >> I am trying to extend the current HTMLParser of lenya 1.2.1 to support
> >> keywords.
> Is there an xml parser for lucene somewhere? Should be fairly easy. The
> documents that i am indexing are xhtml, so there is no need for a parser
> that can handle those illegal html files.


I am trying to understand the purpose of this, so let me know if this
answer if completely off-topic.  I believe your issue can be solved
without touching Java.

I do not think Lucene cares whether data is HTML or XML; it treats it
all as XML.  I have not tried it with poorly written HTML, since Lenya
always closes tags in the correct order, and I have only used Lucene
with Lenya.

Lucene can index data (removing all tags) into several fields which
can be used by search.  The default is to crawl a website for all HTML
pages, then index the entire page into a "content" field.  My version
of search indexes the XML documents in {pub}/content/live, keeps the
"content" field, and adds fields for "language", "title", and
"description".  Each field is configured using an XPATH expression.

So the easy answer should be:
1. Decide to index Lenya's XML or HTML.  If HTML, make certain the
keywords are displayed in the header so they can be accessed using
XPATH.
2. Configure Lucene to add keywords to a new field.  Create the index.
3. Change the Search page to allow selection by keywords.

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTMLParser

Reply via email to