How is the term frequency calculated if I have to add a user-generated document.

Gaurav Ranjan Thu, 18 Apr 2013 23:12:31 -0700

I am a student and studying the functionality of Lucene for my project work.


If I have to add a new user-generated document in lucene with a term having
a particular frequency just like any text file, how do I do it?
For eg, say I have to add the following documents analyzed from an image

doc1 =
{ contents field:
{"red (X15 times) blue(X10 times)"} ,
  name field:
{"doc1"}
}

doc2 =
{ contents field:
{"red (X10 times) blue(X18 times)"} ,
  name field:
{"doc2"}
}

Now when indexing, I should have term freq for "red" as 15 for doc1 and 10
for doc2 ?
The documents doc1 and doc2 can be indexed alongwith the normal text files
if only we can update the frequencies manually. Here I need to have
frequencies indexed as well
(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS).


The DocDelta example provided on this link (
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html?is-external=true)
says :

FreqFile (.frq) --> Header, <TermFreqs, SkipData> TermCount
Header --> CodecHeader
TermFreqs --> <TermFreq> DocFreq
TermFreq --> DocDelta[, Freq?]
SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel>
<SkipDatum>
SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
SkipDatum -->
DocSkip,PayloadLength?,OffsetLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
DocDelta,Freq,DocSkip,PayloadLength,OffsetLength,FreqSkip,ProxSkip --> VInt
SkipChildLevelPointer --> VLong


"For example, the TermFreqs for a term which occurs once in document seven
and three times in document eleven, with frequencies indexed, would be the
following sequence of VInts:

15, 8, 3

If frequencies were omitted (FieldInfo.IndexOptions.DOCS_ONLY) it would be
this sequence of VInts instead:

7,4"

So what should be the DocDelta values for doc1 and doc2 and how? Please
provide any other useful links.

Thanks.

How is the term frequency calculated if I have to add a user-generated document.

Reply via email to