Antonio Calò wrote:
Hi Ted.
Thank you very much for your feedback.
I can see the term frequency for each term, but not fo couples or more term
togheter.
An example: "the quick brown fox jumps over the lazy dog. But the big dog
was sleeping.So The lazy dog didn't see the fox"
So, with your suggestion I'm able to find that tf("dog") = 2,
tf("fox")=3,... (the terms are composed by just a word).
But it seems that TermFrequencyVector cannot answer to this: tf("lazy
dog")=2, tf("quick brown")=1.
Unlikely I've been asked to retrieve the occurrence of a set of concept in a
document and I was trying to use lucene cause my simple mapping algorithm is
too slow :(.
I'll try to see if I can do something with TermFreqVector, or with the
Analizer. OR I'll go to look for another way :)
Antonio
2009/12/16 Ted Dunning <[email protected]>
You need the term frequency vector.
See here
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
This is compatible in 3.0 as well:
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
Note the package change.
On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calò <[email protected]>
wrote:
I All
I Hope that you can help me on this.
I'm looking for a fast way to obtainf for a given word, its term
frequency
(I mean how many times it is available in a single doc). I've looking
into
mail archive and LIA (Lucene In Action) book and I found something like
this:
IndexSearcher index = new IndexSearcher(invertedIndexinRam);
Term term = new Term("doc", "quick");
int occurrence = index.docFreq(term);
ok, occurrence contains the occurrences of the word "quick" into the
index
(In my case the index will contain only one document example "the quick
brown fox jumps over the lazy dog"). In this case the occurrence will be
1.
:)
But now I need to retrieve the occurrency of a composite word: as example
"quick brown fox" but I'm quite in trouble on how could I perform this.
I haven't even really started to use Lucene yet, but I follow this list.
So just an unqualified idea :
- assuming each word is indexed, along with its position in each item
- assuming that you kept all the words, and did not strip out "stop words"
- assuming that you have the list of items which contain all of the
words composing your multi-word term
- then you should be able to determine which items contain
word 1 of your term in position n
word 2 of your term in position n+1
etc..