Grant Ingersoll a écrit :
On Jun 28, 2007, at 5:29 AM, Samuel LEMOINE wrote:
Thanks for the resources about payloads, I'll have a look over it.
About the positions/offsets in .tvf, please tell me if I've well
understood:
The .tvd provides the needed informations concerning the occurrences
of each term in documents, and thanks to these informations, Lucene
is able to determinate how many documents contain the term "foo".
Not exactly, Term Vectors only could tell you how many times foo
occurs in a particular, known document. If you are looking for
general information on a Term and the documents it occurs in (i.e. the
inverted index) have a look at the TermEnum and TermDocs.
Thus the position/offset data contained in .tvf can just consist in a
list of positions in the different documents containing "foo"
concatenated ? I mean, if foo appears in positions 1,30,65 in doc 0,
and positions 27 & 52 in doc 2, the .tvf will give "1 30 65 27 52"
and Lucene rests on .tvd to determine which positions belongs to
which document? (or rather "1 29 35 27 25" as it is delta-positions)
No, you only could find out about doc 0 or doc 2 separately using
TermVectors.
HTH,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Well, I really don't get it :) This file structure is driving me crazy !
I quote the doc i'm resting on, and comment the points that pose me
problems :
(http://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Vectors)
(quote)
Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs>
^NumFields // this structure is repeated for each Field
TVFVersion --> Int
NumTerms --> VInt
Position/Offset --> Byte
TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> ^NumTerms
//this structure is repeated for each Term of each Field
TermText --> <PrefixLength, Suffix>
PrefixLength --> VInt
Suffix --> String
TermFreq --> VInt
Positions --> <VInt>^TermFreq //this "Position" data appears
once per occurrence of each Term of each Field... but as far as I know,
TermFreq is the number of occurrences of a Term, in all documents
regardless their number (not sure of that actually)
Offsets --> <VInt, VInt>^TermFreq
(/quote)
^
I doubt that the "TermFreq" found in this description is the same than
the one found in Frequencies section
(http://lucene.apache.org/java/2_2_0/fileformats.html#Frequencies):
(quote)
TermFreq --> DocDelta, Freq?
TermFreq entries are ordered by increasing document number.
DocDelta determines both the document number and the frequency. In
particular, DocDelta/2 is the difference between this document number
and the previous document number (or zero when this is the first
document in a TermFreqs). When DocDelta is odd, the frequency is one.
When DocDelta is even, the frequency is read as another VInt.
(/quote)
I don't think this is the same type of TermFreq, cause the one described
in Frequencies section would have no sense being put as an exponent in
"Positions --> <VInt>^TermFreq ", cause this notation just means that
the VInt is repeated TermFreq times.
So, to fit with what have been told, I assume that TermFreq is only the
number of occurrences of the Term in *one* document... but in that case,
their should be one .tvf per document, which I really doubt to be so.
I'd add that I've glanced at the TermVectorsReader.java sourcecode, but
it didn't help me to understand how it's supposed to work (I'm not a
great Java performer actually).
Maybe the documentation
http://lucene.apache.org/java/2_2_0/fileformats.html contains a typo,
anyway I don't find it very clear on this point... but it's really
turning my brain upside down.
Thanks a lot to anyone could help me finding rest :)
Cordially,
Samuel