On Jun 27, 2007, at 8:51 AM, Samuel LEMOINE wrote:
Hi everyone !
I'm working on bibliographical researches on Lucene as an intern in
Lingway (which uses Lucene in its main product), and I'm currently
studying Lucene's file system.
There are several things I don't catch in Lucene's file system, and
I thought here was the right place to ask about those questions (I
hope it's the case actually).
The main resource I used is this document:
http://lucene.apache.org/java/2_1_0/fileformats.html
-in the .tvf file (Term Vector file) in Lucene 2.2.0, position &
offsets can be possibly given in the term vector... I don't
understand how it works, since there's only one .tvf per segment
(according to what I've understood), and in the architecture
described, there is no information given about the documents in
which appears each term stored in the TermVector (the informations
document-related are in the .tvd file I assume). The position/
offset informations seems to be simply a list of addresses, but how
can be known the document it refers to? Or is there one .tvf file
per document?
Yes, offsets and positions can be associated with a term vector.
When you ask the IndexReader for a term vector, you give it the
document number and, optionally, a field, which it uses to go look up
in the tvd file the document location in the tvd file. The tvd file
then looks up the specific information in the tvf file. Have a look
at the TermVectorsReader for details on implementation.
-in the .prx file (prositions file), payloads are mentionned and
allow to attach meta-data... what's the purpose of such data? is
there a precise use, or is it only data for the sole user's use?
Payloads have a variety of uses. Search the java-dev archive for the
word Payload and you will find lots of discussion. I also have a few
slides on it in my ApacheCon Europe presentation at http://cnlp.org/
presentations/slides/AdvancedLuceneEU.pdf See also http://
wiki.apache.org/jakarta-lucene/Payload_Planning
Essentially, it can be used to store information on a term by term
level, things like font weight, or XML enclosing tag, or Part of
Speech. The sky really is the limit (that and your disk space) on
what can be stored in a payload.
-many adresses in many files are given under Delta shapes...
Doesn't it slacken the search among the index ? I mean, when a
keyword is looked for, in order to find its position in the right
file, Lucene must find the adress of the previous term and add the
"delta" address... but the previous term adress is also given by a
delta address, and so on, so that as far as I understand it, the
whole file must be climbed back, recursively finding the address of
each term... I assume I've misunderstood something, but don't know
what.
Not quite sure what you are asking, but I will take a stab at it.
Have a look at the section on the Term Dictionary, specifically the
relationship between the tis file and the tii file. The storage
mechanism makes it very easy to find where the keyword is in the file
so that the rest of the information can be easily looked up.
HTH,
Grant
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp
Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]