Grant Ingersoll a écrit :

On Jun 28, 2007, at 5:29 AM, Samuel LEMOINE wrote:
Thanks for the resources about payloads, I'll have a look over it.
About the positions/offsets in .tvf, please tell me if I've well understood: The .tvd provides the needed informations concerning the occurrences of each term in documents, and thanks to these informations, Lucene is able to determinate how many documents contain the term "foo".

Not exactly, Term Vectors only could tell you how many times foo occurs in a particular, known document. If you are looking for general information on a Term and the documents it occurs in (i.e. the inverted index) have a look at the TermEnum and TermDocs.

Thus the position/offset data contained in .tvf can just consist in a list of positions in the different documents containing "foo" concatenated ? I mean, if foo appears in positions 1,30,65 in doc 0, and positions 27 & 52 in doc 2, the .tvf will give "1 30 65 27 52" and Lucene rests on .tvd to determine which positions belongs to which document? (or rather "1 29 35 27 25" as it is delta-positions)

No, you only could find out about doc 0 or doc 2 separately using TermVectors.

HTH,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Well, I really don't get it :) This file structure is driving me crazy !
I quote the doc i'm resting on, and comment the points that pose me problems : (http://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Vectors)

(quote)
Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs> ^NumFields // this structure is repeated for each Field
TVFVersion --> Int
NumTerms --> VInt
Position/Offset --> Byte
TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> ^NumTerms //this structure is repeated for each Term of each Field
TermText --> <PrefixLength, Suffix>
PrefixLength --> VInt
Suffix --> String
TermFreq --> VInt
Positions --> <VInt>^TermFreq //this "Position" data appears once per occurrence of each Term of each Field... but as far as I know, TermFreq is the number of occurrences of a Term, in all documents regardless their number (not sure of that actually)
Offsets --> <VInt, VInt>^TermFreq
(/quote)
^

I doubt that the "TermFreq" found in this description is the same than the one found in Frequencies section (http://lucene.apache.org/java/2_2_0/fileformats.html#Frequencies):

(quote)
TermFreq --> DocDelta, Freq?
TermFreq entries are ordered by increasing document number.
DocDelta determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt.
(/quote)

I don't think this is the same type of TermFreq, cause the one described in Frequencies section would have no sense being put as an exponent in "Positions --> <VInt>^TermFreq ", cause this notation just means that the VInt is repeated TermFreq times. So, to fit with what have been told, I assume that TermFreq is only the number of occurrences of the Term in *one* document... but in that case, their should be one .tvf per document, which I really doubt to be so. I'd add that I've glanced at the TermVectorsReader.java sourcecode, but it didn't help me to understand how it's supposed to work (I'm not a great Java performer actually). Maybe the documentation http://lucene.apache.org/java/2_2_0/fileformats.html contains a typo, anyway I don't find it very clear on this point... but it's really turning my brain upside down.
Thanks a lot to anyone could help me finding rest :)

Cordially,

Samuel

Reply via email to