Re: several existential issues about Lucene's filesystem

Samuel LEMOINE Thu, 28 Jun 2007 06:06:40 -0700

Grant Ingersoll a écrit :

On Jun 28, 2007, at 5:29 AM, Samuel LEMOINE wrote:
Thanks for the resources about payloads, I'll have a look over it.
About the positions/offsets in .tvf, please tell me if I've wellunderstood:The .tvd provides the needed informations concerning the occurrencesof each term in documents, and thanks to these informations, Luceneis able to determinate how many documents contain the term "foo".
Not exactly, Term Vectors only could tell you how many times foooccurs in a particular, known document. If you are looking forgeneral information on a Term and the documents it occurs in (i.e. theinverted index) have a look at the TermEnum and TermDocs.
Thus the position/offset data contained in .tvf can just consist in alist of positions in the different documents containing "foo"concatenated ? I mean, if foo appears in positions 1,30,65 in doc 0,and positions 27 & 52 in doc 2, the .tvf will give "1 30 65 27 52"and Lucene rests on .tvd to determine which positions belongs towhich document? (or rather "1 29 35 27 25" as it is delta-positions)
No, you only could find out about doc 0 or doc 2 separately usingTermVectors.
HTH,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Well, I really don't get it :) This file structure is driving me crazy !

I quote the doc i'm resting on, and comment the points that pose meproblems :(http://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Vectors)


(quote)

Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs>^NumFields // this structure is repeated for each Field

TVFVersion --> Int
NumTerms --> VInt
Position/Offset --> Byte

TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> ^NumTerms//this structure is repeated for each Term of each Field

TermText --> <PrefixLength, Suffix>
PrefixLength --> VInt
Suffix --> String
TermFreq --> VInt

Positions --> <VInt>^TermFreq //this "Position" data appearsonce per occurrence of each Term of each Field... but as far as I know,TermFreq is the number of occurrences of a Term, in all documentsregardless their number (not sure of that actually)

Offsets --> <VInt, VInt>^TermFreq
(/quote)
^

I doubt that the "TermFreq" found in this description is the same thanthe one found in Frequencies section(http://lucene.apache.org/java/2_2_0/fileformats.html#Frequencies):


(quote)
TermFreq --> DocDelta, Freq?
TermFreq entries are ordered by increasing document number.

DocDelta determines both the document number and the frequency. Inparticular, DocDelta/2 is the difference between this document numberand the previous document number (or zero when this is the firstdocument in a TermFreqs). When DocDelta is odd, the frequency is one.When DocDelta is even, the frequency is read as another VInt.

(/quote)

I don't think this is the same type of TermFreq, cause the one describedin Frequencies section would have no sense being put as an exponent in"Positions --> <VInt>^TermFreq ", cause this notation just means thatthe VInt is repeated TermFreq times.So, to fit with what have been told, I assume that TermFreq is only thenumber of occurrences of the Term in *one* document... but in that case,their should be one .tvf per document, which I really doubt to be so.I'd add that I've glanced at the TermVectorsReader.java sourcecode, butit didn't help me to understand how it's supposed to work (I'm not agreat Java performer actually).Maybe the documentationhttp://lucene.apache.org/java/2_2_0/fileformats.html contains a typo,anyway I don't find it very clear on this point... but it's reallyturning my brain upside down.

Thanks a lot to anyone could help me finding rest :)

Cordially,

Samuel

Re: several existential issues about Lucene's filesystem

Reply via email to