I am a student and studying the functionality of Lucene for my project work.
If I have to add a new user-generated document in lucene with a term having
a particular frequency just like any text file, how do I do it?
For eg, say I have to add the following documents analyzed from an image
doc1 =
{ contents field:
{"red (X15 times) blue(X10 times)"} ,
name field:
{"doc1"}
}
doc2 =
{ contents field:
{"red (X10 times) blue(X18 times)"} ,
name field:
{"doc2"}
}
Now when indexing, I should have term freq for "red" as 15 for doc1 and 10
for doc2 ?
The documents doc1 and doc2 can be indexed alongwith the normal text files
if only we can update the frequencies manually. Here I need to have
frequencies indexed as well
(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS).
The DocDelta example provided on this link (
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html?is-external=true)
says :
FreqFile (.frq) --> Header, <TermFreqs, SkipData> TermCount
Header --> CodecHeader
TermFreqs --> <TermFreq> DocFreq
TermFreq --> DocDelta[, Freq?]
SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel>
<SkipDatum>
SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
SkipDatum -->
DocSkip,PayloadLength?,OffsetLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
DocDelta,Freq,DocSkip,PayloadLength,OffsetLength,FreqSkip,ProxSkip --> VInt
SkipChildLevelPointer --> VLong
"For example, the TermFreqs for a term which occurs once in document seven
and three times in document eleven, with frequencies indexed, would be the
following sequence of VInts:
15, 8, 3
If frequencies were omitted (FieldInfo.IndexOptions.DOCS_ONLY) it would be
this sequence of VInts instead:
7,4"
So what should be the DocDelta values for doc1 and doc2 and how? Please
provide any other useful links.
Thanks.