it does. Look at TermPositionVector. It is usually much more efficient to count word sequences at index time, however.
On Thu, Dec 17, 2009 at 3:42 AM, Antonio Calò <[email protected]> wrote: > Hi Ted. yes, your assumption are correct. > > If lucene save position and offset, I should find a way to get occurrence > of > a multiword term. I'll let you know. I'll write some code to understand if > this is the optimum way. > > Many thanks & regards > > Antonio > > 2009/12/17 André Warnier <[email protected]> > > > Antonio Calň wrote: > > > >> Hi Ted. > >> > >> Thank you very much for your feedback. > >> > >> I can see the term frequency for each term, but not fo couples or more > >> term > >> togheter. > >> > >> An example: "the quick brown fox jumps over the lazy dog. But the big > dog > >> was sleeping.So The lazy dog didn't see the fox" > >> > >> So, with your suggestion I'm able to find that tf("dog") = 2, > >> tf("fox")=3,... (the terms are composed by just a word). > >> > >> But it seems that TermFrequencyVector cannot answer to this: tf("lazy > >> dog")=2, tf("quick brown")=1. > >> > >> Unlikely I've been asked to retrieve the occurrence of a set of concept > in > >> a > >> document and I was trying to use lucene cause my simple mapping > algorithm > >> is > >> too slow :(. > >> > >> I'll try to see if I can do something with TermFreqVector, or with the > >> Analizer. OR I'll go to look for another way :) > >> > >> Antonio > >> > >> > >> > >> 2009/12/16 Ted Dunning <[email protected]> > >> > >> You need the term frequency vector. > >>> > >>> See here > >>> > >>> > >>> > http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 > >>> > >>> This is compatible in 3.0 as well: > >>> > >>> > >>> > http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 > >>> > >>> Note the package change. > >>> > >>> > >>> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calň <[email protected]> > >>> > >>> wrote: > >>> > >>> I All > >>>> > >>>> I Hope that you can help me on this. > >>>> > >>>> I'm looking for a fast way to obtainf for a given word, its term > >>>> > >>> frequency > >>> > >>>> (I mean how many times it is available in a single doc). I've looking > >>>> > >>> into > >>> > >>>> mail archive and LIA (Lucene In Action) book and I found something > like > >>>> this: > >>>> > >>>> IndexSearcher index = new IndexSearcher(invertedIndexinRam); > >>>> Term term = new Term("doc", "quick"); > >>>> int occurrence = index.docFreq(term); > >>>> > >>>> ok, occurrence contains the occurrences of the word "quick" into the > >>>> > >>> index > >>> > >>>> (In my case the index will contain only one document example "the > quick > >>>> brown fox jumps over the lazy dog"). In this case the occurrence will > be > >>>> > >>> 1. > >>> > >>>> :) > >>>> > >>>> But now I need to retrieve the occurrency of a composite word: as > >>>> example > >>>> "quick brown fox" but I'm quite in trouble on how could I perform > this. > >>>> > >>>> I haven't even really started to use Lucene yet, but I follow this > > list. > > So just an unqualified idea : > > - assuming each word is indexed, along with its position in each item > > - assuming that you kept all the words, and did not strip out "stop > words" > > - assuming that you have the list of items which contain all of the words > > composing your multi-word term > > - then you should be able to determine which items contain > > word 1 of your term in position n > > word 2 of your term in position n+1 > > etc.. > > > > > > > -- > Antonio Calò > ------------------------------------------ > Software Developer Engineer > @ Intellisemantic > Mail [email protected] > Tel. 011-56.90.429 > ------------------------------------------ > -- Ted Dunning, CTO DeepDyve
