Hi Rao. thanks for your suggestion. I'll try this way too!
Ragards Antonio 2009/12/17 Antonio Calò <[email protected]> > Hi Ted. yes, your assumption are correct. > > If lucene save position and offset, I should find a way to get occurrence > of a multiword term. I'll let you know. I'll write some code to understand > if this is the optimum way. > > Many thanks & regards > > Antonio > > 2009/12/17 André Warnier <[email protected]> > > Antonio Calň wrote: >> >>> Hi Ted. >>> >>> Thank you very much for your feedback. >>> >>> I can see the term frequency for each term, but not fo couples or more >>> term >>> togheter. >>> >>> An example: "the quick brown fox jumps over the lazy dog. But the big dog >>> was sleeping.So The lazy dog didn't see the fox" >>> >>> So, with your suggestion I'm able to find that tf("dog") = 2, >>> tf("fox")=3,... (the terms are composed by just a word). >>> >>> But it seems that TermFrequencyVector cannot answer to this: tf("lazy >>> dog")=2, tf("quick brown")=1. >>> >>> Unlikely I've been asked to retrieve the occurrence of a set of concept >>> in a >>> document and I was trying to use lucene cause my simple mapping algorithm >>> is >>> too slow :(. >>> >>> I'll try to see if I can do something with TermFreqVector, or with the >>> Analizer. OR I'll go to look for another way :) >>> >>> Antonio >>> >>> >>> >>> 2009/12/16 Ted Dunning <[email protected]> >>> >>> You need the term frequency vector. >>>> >>>> See here >>>> >>>> >>>> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 >>>> >>>> This is compatible in 3.0 as well: >>>> >>>> >>>> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 >>>> >>>> Note the package change. >>>> >>>> >>>> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calň <[email protected]> >>>> >>>> wrote: >>>> >>>> I All >>>>> >>>>> I Hope that you can help me on this. >>>>> >>>>> I'm looking for a fast way to obtainf for a given word, its term >>>>> >>>> frequency >>>> >>>>> (I mean how many times it is available in a single doc). I've looking >>>>> >>>> into >>>> >>>>> mail archive and LIA (Lucene In Action) book and I found something like >>>>> this: >>>>> >>>>> IndexSearcher index = new IndexSearcher(invertedIndexinRam); >>>>> Term term = new Term("doc", "quick"); >>>>> int occurrence = index.docFreq(term); >>>>> >>>>> ok, occurrence contains the occurrences of the word "quick" into the >>>>> >>>> index >>>> >>>>> (In my case the index will contain only one document example "the quick >>>>> brown fox jumps over the lazy dog"). In this case the occurrence will >>>>> be >>>>> >>>> 1. >>>> >>>>> :) >>>>> >>>>> But now I need to retrieve the occurrency of a composite word: as >>>>> example >>>>> "quick brown fox" but I'm quite in trouble on how could I perform this. >>>>> >>>>> I haven't even really started to use Lucene yet, but I follow this >> list. >> So just an unqualified idea : >> - assuming each word is indexed, along with its position in each item >> - assuming that you kept all the words, and did not strip out "stop words" >> - assuming that you have the list of items which contain all of the words >> composing your multi-word term >> - then you should be able to determine which items contain >> word 1 of your term in position n >> word 2 of your term in position n+1 >> etc.. >> >> > > > -- > Antonio Calò > ------------------------------------------ > Software Developer Engineer > @ Intellisemantic > Mail [email protected] > Tel. 011-56.90.429 > ------------------------------------------ > -- Antonio Calò ------------------------------------------ Software Developer Engineer @ Intellisemantic Mail [email protected] Tel. 011-56.90.429 ------------------------------------------
