RE: Frequency Term of Composite words

Rao, Vaijanath Thu, 17 Dec 2009 03:19:38 -0800

Hi Antonio,

One of the simple way would be to generate the ngram of the text and store them 
as is.

For example : "the quick brown fox jumps over the lazy dog. But the big dog was 
sleeping.So The lazy dog didn't see the fox"
You decide your system can support concept upto an len of 3 generate ngrams for 
the text
So the output of your ngrams would be something like this
The, the quick, the quick brown and so on ..

Then create an keyword analyzer for this field and store all these values as 
part of it. Then you can call the TermFrequencyVector on that text.

Hope this helps 

--Thanks and Regards
Vaijanath N. Rao

-----Original Message-----
From: Antonio Calò [mailto:[email protected]] 
Sent: Thursday, December 17, 2009 4:25 PM
To: [email protected]
Subject: Re: Frequency Term of Composite words

Hi Ted.

Thank you very much for your feedback.

I can see the term frequency for each term, but not fo couples or more term 
togheter.

An example: "the quick brown fox jumps over the lazy dog. But the big dog was 
sleeping.So The lazy dog didn't see the fox"

So, with your suggestion I'm able to find that tf("dog") = 2, tf("fox")=3,... 
(the terms are composed by  just a word).

But it seems that TermFrequencyVector cannot answer to this: tf("lazy dog")=2, 
tf("quick brown")=1.

Unlikely I've been asked to retrieve the occurrence of a set of concept in a 
document and I was trying to use lucene cause my simple mapping algorithm is 
too slow :(.

I'll try to see if I can do something with TermFreqVector, or with the 
Analizer. OR I'll go to look for another way :)

Antonio

2009/12/16 Ted Dunning <[email protected]>

> You need the term frequency vector.
>
> See here
>
> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexR
> eader.html#getTermFreqVector%28int,%20java.lang.String%29
>
> This is compatible in 3.0 as well:
>
> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/I
> ndexReader.html#getTermFreqVector%28int,%20java.lang.String%29
>
> Note the package change.
>
>
> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calò <[email protected]>
> wrote:
>
> > I All
> >
> > I Hope that you can help me on this.
> >
> > I'm looking for a fast way to obtainf for a given word, its term
> frequency
> > (I mean how many times it is available in a single doc). I've 
> > looking
> into
> > mail archive and LIA (Lucene In Action) book and I found something 
> > like
> > this:
> >
> > IndexSearcher index = new IndexSearcher(invertedIndexinRam);
> > Term term = new Term("doc", "quick"); int occurrence = 
> > index.docFreq(term);
> >
> > ok, occurrence contains the occurrences of the word "quick" into the
> index
> > (In my case the index will contain only one document example "the 
> > quick brown fox jumps over the lazy dog"). In this case the 
> > occurrence will be
> 1.
> > :)
> >
> > But now I need to retrieve the occurrency of a composite word: as 
> > example "quick brown fox" but I'm quite in trouble on how could I perform 
> > this.
> >
> > Thanks in advance for your help.
> >
> > Best Regards.
> >
> > Antonio
> >
> >
> >
> > --
> > Antonio Calò
> > ------------------------------------------
> > Software Developer Engineer
> > @ Intellisemantic
> > Mail [email protected]
> > Tel. 011-56.90.429
> > ------------------------------------------
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

--
Antonio Calò
------------------------------------------
Software Developer Engineer
@ Intellisemantic
Mail [email protected]
Tel. 011-56.90.429
------------------------------------------

RE: Frequency Term of Composite words

Reply via email to