Mark, Russ, thanks for the replies.
Mark, this looks great, I think it's exactly what I was looking for. I
think this should definitely be added to Lucene when it is stable
enough. I suspect there are others that would find it useful.
JLuna
Mark Miller wrote:
Take a look at: https://issues.apache.org/jira/browse/LUCENE-794
This is an extension to the Highlighter that highlights span and
proximity queries. If you rewrite the query it will also do fuzzy
queries. I am sure you can easily steal some of the code to do what
you want.
Keep in mind, because of how Lucene's SpanQuery works, if you say to
find 'mark within 4 of ball', Lucene will not find all occurrences.
ie: 'mark close to ball ball' -- even if you say find mark within 20
of ball, a Span query will only find the first occurrence of ball even
though both occurrences are within 20. If ball was on both sides of
mark, both would match, but after finding the first ball with 20 of
mark, Span doesnt continue looking for another.
- Mark
Jose Luna wrote:
Hello,
I am looking for some advice regarding which tools I might use to
solve my problem. I apologize ahead of time for the long explanation.
Problem Description: I would like to index a set of very large HTML
documents. I would then be able to run two different kinds of
queries: proximity queries, and fuzzy phrase queries. I would like
to get the exact positions of the matching results from the query (I
need to modify the original documents at these positions.) I will
only need to search one document at a time, i.e., I already know
which document I'll be looking in, so what's important is finding the
positions of the hits within that document.
For example, for a fuzzy search, I may want to search for "arterial
oxygen saturation". I would want this to match "arterial oxygen
saturate", and I would want to get the position of where it matches.
I would also like to do proximity searches, with these broken into
separate terms. So, I may be searching for "arterial", "oxygen", and
"saturate" all within 10 terms of each other, and get the positions
of the cases that match.
To the best of my understanding, Lucene is not a good choice to solve
this problem (please correct me if I'm wrong). As far as I can
tell, Lucene breaks up a document into a set of terms, and indexes
these in some sort of structure. My guess is a B+ tree, but I'm
curious to learn more about it -- I couldn't find much in the
documentation about the underlying index structure. Anyway, this
means that the keys->pointer pairs in the index are basically
term->documenID pairs. So this isn't very suitable for my problem. I
already know which document I want to search, I'm interested in the
position of hits. If I were to search for the phrase "arterial
oxygen saturation", this would be broken into terms and I could
iterate through all of the TermPositions for a given term in the
document, and try to find out where these terms are adjacent in the
document. Considering that my document is very large, the phrases
can be 10+ terms, and I need to do this hundreds of times, this
doesn't sound like a very good solution. If we introduce the idea of
fuzzy matches and proximity searches, it seems like this task of
iterating through TermPositions becomes very complicated.
I've spent time reading the docs, creating a test program, and
reading the mailing list. As far as I can tell, Lucene is geared
towards document based queries, and isn't the ideal tool for my
problem. I think an index based on a suffix tree (or variation of)
would better meet my needs, but I'm not sure how well these perform
with fuzzy and proximity searches. I've looked around, and I can't
seem to find a good opensource indexing framework like lucene that's
based on a suffix tree. Are there any suggestions for tools that
would help with this problem? Does anyone have any suggestions on
how I might bend Lucene to meet my needs?
Thanks in advance,
JLuna
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]