Hi Antonio, One of the simple way would be to generate the ngram of the text and store them as is.
For example : "the quick brown fox jumps over the lazy dog. But the big dog was sleeping.So The lazy dog didn't see the fox" You decide your system can support concept upto an len of 3 generate ngrams for the text So the output of your ngrams would be something like this The, the quick, the quick brown and so on .. Then create an keyword analyzer for this field and store all these values as part of it. Then you can call the TermFrequencyVector on that text. Hope this helps --Thanks and Regards Vaijanath N. Rao -----Original Message----- From: Antonio Calò [mailto:[email protected]] Sent: Thursday, December 17, 2009 4:25 PM To: [email protected] Subject: Re: Frequency Term of Composite words Hi Ted. Thank you very much for your feedback. I can see the term frequency for each term, but not fo couples or more term togheter. An example: "the quick brown fox jumps over the lazy dog. But the big dog was sleeping.So The lazy dog didn't see the fox" So, with your suggestion I'm able to find that tf("dog") = 2, tf("fox")=3,... (the terms are composed by just a word). But it seems that TermFrequencyVector cannot answer to this: tf("lazy dog")=2, tf("quick brown")=1. Unlikely I've been asked to retrieve the occurrence of a set of concept in a document and I was trying to use lucene cause my simple mapping algorithm is too slow :(. I'll try to see if I can do something with TermFreqVector, or with the Analizer. OR I'll go to look for another way :) Antonio 2009/12/16 Ted Dunning <[email protected]> > You need the term frequency vector. > > See here > > http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexR > eader.html#getTermFreqVector%28int,%20java.lang.String%29 > > This is compatible in 3.0 as well: > > http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/I > ndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 > > Note the package change. > > > On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calò <[email protected]> > wrote: > > > I All > > > > I Hope that you can help me on this. > > > > I'm looking for a fast way to obtainf for a given word, its term > frequency > > (I mean how many times it is available in a single doc). I've > > looking > into > > mail archive and LIA (Lucene In Action) book and I found something > > like > > this: > > > > IndexSearcher index = new IndexSearcher(invertedIndexinRam); > > Term term = new Term("doc", "quick"); int occurrence = > > index.docFreq(term); > > > > ok, occurrence contains the occurrences of the word "quick" into the > index > > (In my case the index will contain only one document example "the > > quick brown fox jumps over the lazy dog"). In this case the > > occurrence will be > 1. > > :) > > > > But now I need to retrieve the occurrency of a composite word: as > > example "quick brown fox" but I'm quite in trouble on how could I perform > > this. > > > > Thanks in advance for your help. > > > > Best Regards. > > > > Antonio > > > > > > > > -- > > Antonio Calò > > ------------------------------------------ > > Software Developer Engineer > > @ Intellisemantic > > Mail [email protected] > > Tel. 011-56.90.429 > > ------------------------------------------ > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Antonio Calò ------------------------------------------ Software Developer Engineer @ Intellisemantic Mail [email protected] Tel. 011-56.90.429 ------------------------------------------
