Re: Advice regarding fuzzy phrase searching

Jose Luna Wed, 12 Dec 2007 08:15:56 -0800

Mark, Russ, thanks for the replies.

Mark, this looks great, I think it's exactly what I was looking for. Ithink this should definitely be added to Lucene when it is stableenough. I suspect there are others that would find it useful.


JLuna

Mark Miller wrote:

Take a look at: https://issues.apache.org/jira/browse/LUCENE-794
This is an extension to the Highlighter that highlights span andproximity queries. If you rewrite the query it will also do fuzzyqueries. I am sure you can easily steal some of the code to do whatyou want.
Keep in mind, because of how Lucene's SpanQuery works, if you say tofind 'mark within 4 of ball', Lucene will not find all occurrences.ie: 'mark close to ball ball' -- even if you say find mark within 20of ball, a Span query will only find the first occurrence of ball eventhough both occurrences are within 20. If ball was on both sides ofmark, both would match, but after finding the first ball with 20 ofmark, Span doesnt continue looking for another.
- Mark

Jose Luna wrote:
Hello,
I am looking for some advice regarding which tools I might use tosolve my problem. I apologize ahead of time for the long explanation.
Problem Description: I would like to index a set of very large HTMLdocuments. I would then be able to run two different kinds ofqueries: proximity queries, and fuzzy phrase queries. I would liketo get the exact positions of the matching results from the query (Ineed to modify the original documents at these positions.) I willonly need to search one document at a time, i.e., I already knowwhich document I'll be looking in, so what's important is finding thepositions of the hits within that document.
For example, for a fuzzy search, I may want to search for "arterialoxygen saturation". I would want this to match "arterial oxygensaturate", and I would want to get the position of where it matches.I would also like to do proximity searches, with these broken intoseparate terms. So, I may be searching for "arterial", "oxygen", and"saturate" all within 10 terms of each other, and get the positionsof the cases that match.
To the best of my understanding, Lucene is not a good choice to solvethis problem (please correct me if I'm wrong). As far as I cantell, Lucene breaks up a document into a set of terms, and indexesthese in some sort of structure. My guess is a B+ tree, but I'mcurious to learn more about it -- I couldn't find much in thedocumentation about the underlying index structure. Anyway, thismeans that the keys->pointer pairs in the index are basicallyterm->documenID pairs. So this isn't very suitable for my problem. Ialready know which document I want to search, I'm interested in theposition of hits. If I were to search for the phrase "arterialoxygen saturation", this would be broken into terms and I coulditerate through all of the TermPositions for a given term in thedocument, and try to find out where these terms are adjacent in thedocument. Considering that my document is very large, the phrasescan be 10+ terms, and I need to do this hundreds of times, thisdoesn't sound like a very good solution. If we introduce the idea offuzzy matches and proximity searches, it seems like this task ofiterating through TermPositions becomes very complicated.I've spent time reading the docs, creating a test program, andreading the mailing list. As far as I can tell, Lucene is gearedtowards document based queries, and isn't the ideal tool for myproblem. I think an index based on a suffix tree (or variation of)would better meet my needs, but I'm not sure how well these performwith fuzzy and proximity searches. I've looked around, and I can'tseem to find a good opensource indexing framework like lucene that'sbased on a suffix tree. Are there any suggestions for tools thatwould help with this problem? Does anyone have any suggestions onhow I might bend Lucene to meet my needs?
Thanks in advance,

JLuna
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Advice regarding fuzzy phrase searching

Reply via email to