Re: use Lucene to index sentences

Marc Hadfield Mon, 06 Feb 2006 15:09:48 -0800


Hi AJ -

Performance would depend on the kind of queries you are going to performagainst sentences. If you are going to be querying for phrases(multi-token), want to make use of stemming, or any kind of termexpansion (wildcare, synonyms, etc), I imagine lucene would be muchsuperior, but I don't have experience with mysql full text search.

---marc


AJ Chen wrote:

Hi Marc,
Thanks for your suggestions. Marking sentences in documents and using span
query is a good approach.  How do you compare its performance to a database
approach? For example, sentences can be stored in mysql, one sentence per
row, and they can be searched by mysql's full text search feature. Using
database, it will be also easy to tell which document the matched sentence
belongs to.

AJ

On 2/6/06, Marc Hadfield <[EMAIL PROTECTED]> wrote:

Hi AJ -

Depending on your need, you could create a lucene document for each
sentence (in which case searching and returning sentences is trivial),
or create a lucene document for each of your documents, with embedded
sentence start/stop markers (as a special symbol).  or, instead of a
special symbol, you can increase the token count after each
end-of-sentence so that there is a large gap inbetween sentences -- this
will give higher scores to intra-sentence matches.

if you insert special sentence marker symbols, then you could use a span
search to guarantee that a phrase happens inside a sentence.  when a
match occurs, you can use the document's termpositionvector object to
re-create the original sentence, or alternatively, use the embedded
sentence number in lucene (perhaps symbols like "__sentence_start" and
"__sentence_num_20") to grab the original sentence from a file
containing the full text with sentence markers (perhaps xml tags:
"<sentence num=20>").

I use the techniques such as the above for a very large lucene index of
documents with embedded sentence markers.  There are various trade-offs
in terms of index size (how much info to keep in index), expected query
performance, and so on.

---marc hadfield



AJ Chen wrote:

I'll appreciate any advice on whether Lucene is appropriate for

index/search

sentences.  I have millions of documents broken down into millions of
sentences. Each sentence does not exist as a document.  All these

sentences

are in a small number of big files. How can I use Lucene to index/search

the

sentences? Search will return which sentence matches the query.  If

Lucene

does not do it, any better approach besides using mysql database?

Thanks,
AJ

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: use Lucene to index sentences

Reply via email to