Hi everyone !
I'm working on bibliographical researches on Lucene as an intern in
Lingway (which uses Lucene in its main product), and I'm currently
studying Lucene's file system.
There are several things I don't catch in Lucene's file system, and I
thought here was the right place to ask about those questions (I hope
it's the case actually).
The main resource I used is this document:
http://lucene.apache.org/java/2_1_0/fileformats.html
-in the .tvf file (Term Vector file) in Lucene 2.2.0, position & offsets
can be possibly given in the term vector... I don't understand how it
works, since there's only one .tvf per segment (according to what I've
understood), and in the architecture described, there is no information
given about the documents in which appears each term stored in the
TermVector (the informations document-related are in the .tvd file I
assume). The position/offset informations seems to be simply a list of
addresses, but how can be known the document it refers to? Or is there
one .tvf file per document?
-in the .prx file (prositions file), payloads are mentionned and allow
to attach meta-data... what's the purpose of such data? is there a
precise use, or is it only data for the sole user's use?
-many adresses in many files are given under Delta shapes... Doesn't it
slacken the search among the index ? I mean, when a keyword is looked
for, in order to find its position in the right file, Lucene must find
the adress of the previous term and add the "delta" address... but the
previous term adress is also given by a delta address, and so on, so
that as far as I understand it, the whole file must be climbed back,
recursively finding the address of each term... I assume I've
misunderstood something, but don't know what.
I apologize for the length of my mail, and the approximative english...
Thanks a lot for reading, and far more for answering ^^
Samuel
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]