RE: TermVector support - first release

Doug Cutting Fri, 19 Oct 2001 19:13:50 -0700

Dmitry,

Wow!  This looks great!


I was preparing a response to your questions of last weekend, but it seems
like you figured out a lot of it on your own.  I've attached that response
anyway, in case you're still interested.

Once we get 1.2 out the door I'd like to make you a committer (providing
others approve) so that you can commit these changes yourself.  I'd also
still like to review them a bit more, but some of that can  happen after
they are committed.

My biggest question is about the field-orientation of this.  I had imagined
this to be more document oriented, that there would be a single
TermFreqVector per document, rather than one per field.  That would simplify
things a bit, and make it a bit more efficient.  Of course, one could always
construct the full-document freq vector by combining the field vectors, but
the question is, do folks need the field-specific vectors?

Overall, Bravo!

Doug

> -----Original Message-----
> From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, October 18, 2001 1:56 PM
> To: [EMAIL PROTECTED]
> Subject: TermVector support - first release
> 
> 
> Greetings, everyone!
> 
> I have the first version of the term vector support ready to go. I'm 
> attaching a file with release notes that explain breifly what the new 
> capabilities are and what there changes were to make the 
> happen. There 
> are some limitations that are also described. The zip file 
> contains new 
> files, to be added. The txt file is the result of cvs diff -u against 
> the current CVS repository.
> 
> I am really interested in feedback. First, do the APIs work for your 
> needs? Also, does everything work? What kind of performance you are 
> seeing? Are there things that could be done better 
> (especially in terms 
> of file structures and reading of those files, I think this 
> is where the 
> next layer of optimizations should come from).
> 
> In terms of riskiness, these changes are pretty risky, so I 
> don't think 
> they should go into the 1.2. But I've been using them for the 
> past few 
> days and I didn't have to touch the files at all, so I think they are 
> pretty stable.
> 
> Have fun, everyone.
> Dmitry.
> 
>

> From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]

> Is there any particular reason why the "tokenized" bit is
> stored in the fdt file, while the "indexed" bit is stored
> in the fnm?  Why not put both in fnm?

A given field is permitted to be tokenized in some documents
in an index and not tokenized in others.  However all
instances of a field in an index must be either indexed or
not.  [[ Explain why... ]]

> I'm planning to add another bit: "storeTermVector" (better
> name, anyone?), which will indicate that the field's term
> vector will need to be stored.

I like the option of storing vectors for all indexed
fields.  Perhaps IndexWriter's constructor could take
another optional flag that determines whether it stores
vectors.  It would be illegal to specify this except when
the 'create' flag is true.  Does that sound reasonable?

> The term vector, as I understand it, is a list of unique
> terms that occur in a given field. They will be stored by
> term id (in ascending order of IDs, not terms).

I think term id order and term order should in fact be the
same.  Terms can be assigned id's based on their position in
the TermEnum.

In Lucene there are only two index creating operations:
create an index for a single document, and merge N indexes
into a single index.  If these can both be done efficiently
then indexing is efficient.

Creating a vector index for a single document with term
numbers that correspond to positions in the TermEnum is easy
enough.  Efficient merging is also possible, using the
technique I mentioned in a previous message.  So I don't see
why these shouldn't be used for term ids.

> In addition to the terms, I'm planning to 
> store the frequency of the term (the number of times it occurs in the 
> field). This, together with the total number of terms in the field, 
> should be enough to compute the term's weight, right? My application 
> doesn't need these weights, so I'm not sure what people need in this 
> regard. Please advise.

IndexReader.docFreq(Term) and IndexReader.maxDoc() are
sufficient to compute IDF weights, the standard used by Lucene.

> In addition to the terms and frequencies, I will also
> store positions in which these terms occur in the
> field. Actually, this is already stored (used by the
> TermPositions functionality), so I will only store
> pointers into the prx file. This may not be needed for
> clustering, but I need this for my application. Some of
> the text processing that we do is based on relative
> positioning of terms in a document.

Since positions are the most high frequency data, this will
have a major impact on index size and performance.  I
suggest that this data be stored at least at the end of the
vector data for a document, and perhaps even in a separate
file.  It might probably also be good to have a flag to
disable this, for folks who just want vectors.

> Between the term vector and the positions, it will be
> possible to recreate the contents of a field except for
> word breaks, so I considered using the "stored" +
> "tokenized" to mean that a termvector should be stored and
> only storing the information in this way, instead of
> essentially storing it twice. However, at present, I think
> that it is useful to store the original content, breaks
> and all. Reactions, suggestions?

Stored should mean stored literally, not reconstructable.

> Speaking of the stored fields, someone suggested adding binary 
> storage to documents so that serialized objects can be 
> stored.

I don't have strong feelings about this.  I do worry some
that it is the sort of feature that people will abuse, and
want to extend in crazy directions.

> These are the files I'm planning to add to each segment:
>      "fvx" file - Field Vector Index. Modeled on the fdx file. Has a 
> fixed length, 8-byte, record per document in a segment. The 8 bytes 
> store a long pointer into the "fvt" file where the record for this 
> document begins.

Sounds good.  You can seek based on docId*8.  Easy to write,
easy to merge.

>      "fvt" file - Field Vector Table. Modeled in part on "fdt" and in 
> part on "tis" file. Each document record in this file looks like this:
> 
>       document_record :
>       [VInt] - number of fields (only fields with 
> storeTermVector flag set)
>       { field_record, ... }- field records, as many as specified above

I don't see why you need to organize this by field.  Why can't
the document_record just be:
  [VInt] -- number of terms in 
  { term_record, ... }

>       field_record :
>       [VInt] - field number, just like in the "fdt" file
>       [byte] - flags, don't know if we are going to need any, 
> but seems like we might?

Leave 'em out if you don't have a use.

>       [VInt] - maxTerm, 1+numberOfTerms - just like maxDoc. Used for 
> array allocation and term weight calculations?

I don't get what this is for.

>       [VInt] - numTerms, count of unique terms in the vector, 
> number of term records that follow
>       { term_record, ... } - term records, as many as 
> specified above, represent unique terms in the field
> 
>       term_record:
>       [VInt] - term id increment, restarts from 0 for each field
>       [VInt] - term frequency in this field, used for weight 
> calculations and for the count of positions in "prx" file
>       [VInt] - "prx" pointer increment, restarts from 0 for each field

As mentioned above, I think the "prx" data should at least
be moved to the end of the document_record.  I am also not
sure what the increment is relative to.  I assume it is a
pointer into the "prx" file's data for a term.  The index
already stores (in the "tis" file) a pointer to the start of
the term's data in the "prx" file.  Is this relative to
that?  Since a vector only has one entry per term, there's
nothing else in this vector that it could be relative to.

> A couple of questions on the file formats that I would
> really like feedback on:
>
> Specifically, I'm trying to identify what makes access to
> the document fields ("fdx" and "fdt" files) slow, and make
> sure I avoid those problems. From what I can tell, the
> only thing that makes that access slow is the size of the
> document data, in which case we have nothing to worry
> about. Is that right?

Why do you say they're slow?  Because access is discouraged
from the inner search loop?  Currently the inner search loop
only needs to read a vInt or two per document scored.  These
are primarily sequential reads from a file.  If you add to
that two random access seeks plus reading and constructing a
document object, then search gets a *lot* slower.  So
restricting calls to IndexReader.doc() to documents that are
to be displayed, and not consulting document field values
during the selection of documents makes search a lot faster.

> *) I don't see any place to apply the trick used in the "tii" 
> and "tis" 
> files - namely loading every 128th element into memory and 
> using that as 
> an index into a larger file. I don't think this can be 
> applied because 
> we are really not "searching" for anything, we just do direct 
> access by 
> document id. Am I missing anything?

Nope.

> Finally, users are likely to access termvectors from a given field 
> only. This may be a good reason to optimize access to each 
> field_record in the proposed "fvt" file.

Perhaps.  I had imagined that folks would usually want the term freq
vector for the whole document, rather than field by field.  If you did
things this way it would remove the additional redirection.

Doug

RE: TermVector support - first release

Reply via email to