AW: Part-Of Match

sven Mon, 16 Jan 2006 04:31:25 -0800

Hi Hoss,

thanks for the answer, and yes you have described the problem perfectly.
I think you are right lucene is in fact not the best way of solving it.
I decided to simply build a letter trie consisting of all concepts and
then simply do a search with that document on the trie.
This brings exact matches only on the one hand (and thats exactly what I
need) and furthermore yields matches even for concepts that are in plural
form in the query document.
So the "von Willebrands" will yield "von Willebrand".


Thanks for your efforts,
Sven


 --- Ursprüngliche Nachricht ---
Datum: 15.01.2006 22:14
Von: java-user@lucene.apache.org
An: java-user@lucene.apache.org
Betreff: Re: AW: Part-Of Match

>
> : >>von Willebrand<< is not the query but a document in the index....
The task
> : is to detect exact matches of phrases inside a query (large document)
with
> : these phrases stored in the index.
>
> Lemme see if i can restate your problem...
>
> You want to build a data repository in which you insert a large
magnatude
> of "concepts" where a concept is a short phrase consisting of a few
words
> (possibly just one word).  The words in any given concept phrase may
> overlap (or be a super set) of the words in other concepts.
>
> Once this concept repository is built, you want to to build a black box
> arround it, such that people can hand your black box a "document"
> (ie: a research paper, a newpaper article, a short story, ...
> some text consisting of many many sentences) and you want your black
box
> to then return the list of concepts that match the input document, such
> that the cnceptss with the highest score are concepts whose phrase
appears
> exactly in the input document.  Concepts whose phrase doesn't appear
> exactly in the document shoudl still be returned, but with a lower
score
> based on how many words in the concept's phrase are found in the input
> document.
>
>       (have i adequetly described your problem?)
>
> It's an interesting idea.  can it be done with lucene? ... i can think
of
> one kludgy mechanism for doing it but i'd be very suprised if there
isn't
> a better way (or if there is some other software library out there that
> would be more suited)
>
> Build a permentant index in which each concept is a Lucene Document.
> these documents really only need one stored/tokenized/indexed field
> containing the phrase (if you want other payload fields that's up to
you).
>
> Each time you are asked to analyze a Text sample and return matching
> phrases, run the text through your analyzer to get back a tokenstream,
and
> for each of those tokens, use a TermDocs iterator to find out if any
> phrase in your concept index contains that term, and if so which ones.
> (you could also do this by building a boolean OR query out of all the
> words in your input document -- but that may run into performance
> limitatios if your input docs are too big, and it will try to score
each
> concept which isn't neccessary so even for short input text it's less
> efficient).
>
> Now you have an (unordered) list of concepts that have something to do
> with your input text.
>
> Next build a RAMDirectory based index consisting of exactly one
document
> which you build from the input text.  Loop over that list of concepts
you
> got, and build a boolean query out of each one along the lines that
> Daniel described: a phrase query on the whole concept phrase along with
> term queries for each individual word -- all optional.  run each of
these
> boolean queries against your one document RAMDirectory.  the higher the
> score, the better that concept applies to your input text.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: Part-Of Match

Reply via email to