Dmitry, Wow! This looks great!
I was preparing a response to your questions of last weekend, but it seems like you figured out a lot of it on your own. I've attached that response anyway, in case you're still interested. Once we get 1.2 out the door I'd like to make you a committer (providing others approve) so that you can commit these changes yourself. I'd also still like to review them a bit more, but some of that can happen after they are committed. My biggest question is about the field-orientation of this. I had imagined this to be more document oriented, that there would be a single TermFreqVector per document, rather than one per field. That would simplify things a bit, and make it a bit more efficient. Of course, one could always construct the full-document freq vector by combining the field vectors, but the question is, do folks need the field-specific vectors? Overall, Bravo! Doug > -----Original Message----- > From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]] > Sent: Thursday, October 18, 2001 1:56 PM > To: [EMAIL PROTECTED] > Subject: TermVector support - first release > > > Greetings, everyone! > > I have the first version of the term vector support ready to go. I'm > attaching a file with release notes that explain breifly what the new > capabilities are and what there changes were to make the > happen. There > are some limitations that are also described. The zip file > contains new > files, to be added. The txt file is the result of cvs diff -u against > the current CVS repository. > > I am really interested in feedback. First, do the APIs work for your > needs? Also, does everything work? What kind of performance you are > seeing? Are there things that could be done better > (especially in terms > of file structures and reading of those files, I think this > is where the > next layer of optimizations should come from). > > In terms of riskiness, these changes are pretty risky, so I > don't think > they should go into the 1.2. But I've been using them for the > past few > days and I didn't have to touch the files at all, so I think they are > pretty stable. > > Have fun, everyone. > Dmitry. > >
> From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]] > Is there any particular reason why the "tokenized" bit is > stored in the fdt file, while the "indexed" bit is stored > in the fnm? Why not put both in fnm? A given field is permitted to be tokenized in some documents in an index and not tokenized in others. However all instances of a field in an index must be either indexed or not. [[ Explain why... ]] > I'm planning to add another bit: "storeTermVector" (better > name, anyone?), which will indicate that the field's term > vector will need to be stored. I like the option of storing vectors for all indexed fields. Perhaps IndexWriter's constructor could take another optional flag that determines whether it stores vectors. It would be illegal to specify this except when the 'create' flag is true. Does that sound reasonable? > The term vector, as I understand it, is a list of unique > terms that occur in a given field. They will be stored by > term id (in ascending order of IDs, not terms). I think term id order and term order should in fact be the same. Terms can be assigned id's based on their position in the TermEnum. In Lucene there are only two index creating operations: create an index for a single document, and merge N indexes into a single index. If these can both be done efficiently then indexing is efficient. Creating a vector index for a single document with term numbers that correspond to positions in the TermEnum is easy enough. Efficient merging is also possible, using the technique I mentioned in a previous message. So I don't see why these shouldn't be used for term ids. > In addition to the terms, I'm planning to > store the frequency of the term (the number of times it occurs in the > field). This, together with the total number of terms in the field, > should be enough to compute the term's weight, right? My application > doesn't need these weights, so I'm not sure what people need in this > regard. Please advise. IndexReader.docFreq(Term) and IndexReader.maxDoc() are sufficient to compute IDF weights, the standard used by Lucene. > In addition to the terms and frequencies, I will also > store positions in which these terms occur in the > field. Actually, this is already stored (used by the > TermPositions functionality), so I will only store > pointers into the prx file. This may not be needed for > clustering, but I need this for my application. Some of > the text processing that we do is based on relative > positioning of terms in a document. Since positions are the most high frequency data, this will have a major impact on index size and performance. I suggest that this data be stored at least at the end of the vector data for a document, and perhaps even in a separate file. It might probably also be good to have a flag to disable this, for folks who just want vectors. > Between the term vector and the positions, it will be > possible to recreate the contents of a field except for > word breaks, so I considered using the "stored" + > "tokenized" to mean that a termvector should be stored and > only storing the information in this way, instead of > essentially storing it twice. However, at present, I think > that it is useful to store the original content, breaks > and all. Reactions, suggestions? Stored should mean stored literally, not reconstructable. > Speaking of the stored fields, someone suggested adding binary > storage to documents so that serialized objects can be > stored. I don't have strong feelings about this. I do worry some that it is the sort of feature that people will abuse, and want to extend in crazy directions. > These are the files I'm planning to add to each segment: > "fvx" file - Field Vector Index. Modeled on the fdx file. Has a > fixed length, 8-byte, record per document in a segment. The 8 bytes > store a long pointer into the "fvt" file where the record for this > document begins. Sounds good. You can seek based on docId*8. Easy to write, easy to merge. > "fvt" file - Field Vector Table. Modeled in part on "fdt" and in > part on "tis" file. Each document record in this file looks like this: > > document_record : > [VInt] - number of fields (only fields with > storeTermVector flag set) > { field_record, ... }- field records, as many as specified above I don't see why you need to organize this by field. Why can't the document_record just be: [VInt] -- number of terms in { term_record, ... } > field_record : > [VInt] - field number, just like in the "fdt" file > [byte] - flags, don't know if we are going to need any, > but seems like we might? Leave 'em out if you don't have a use. > [VInt] - maxTerm, 1+numberOfTerms - just like maxDoc. Used for > array allocation and term weight calculations? I don't get what this is for. > [VInt] - numTerms, count of unique terms in the vector, > number of term records that follow > { term_record, ... } - term records, as many as > specified above, represent unique terms in the field > > term_record: > [VInt] - term id increment, restarts from 0 for each field > [VInt] - term frequency in this field, used for weight > calculations and for the count of positions in "prx" file > [VInt] - "prx" pointer increment, restarts from 0 for each field As mentioned above, I think the "prx" data should at least be moved to the end of the document_record. I am also not sure what the increment is relative to. I assume it is a pointer into the "prx" file's data for a term. The index already stores (in the "tis" file) a pointer to the start of the term's data in the "prx" file. Is this relative to that? Since a vector only has one entry per term, there's nothing else in this vector that it could be relative to. > A couple of questions on the file formats that I would > really like feedback on: > > Specifically, I'm trying to identify what makes access to > the document fields ("fdx" and "fdt" files) slow, and make > sure I avoid those problems. From what I can tell, the > only thing that makes that access slow is the size of the > document data, in which case we have nothing to worry > about. Is that right? Why do you say they're slow? Because access is discouraged from the inner search loop? Currently the inner search loop only needs to read a vInt or two per document scored. These are primarily sequential reads from a file. If you add to that two random access seeks plus reading and constructing a document object, then search gets a *lot* slower. So restricting calls to IndexReader.doc() to documents that are to be displayed, and not consulting document field values during the selection of documents makes search a lot faster. > *) I don't see any place to apply the trick used in the "tii" > and "tis" > files - namely loading every 128th element into memory and > using that as > an index into a larger file. I don't think this can be > applied because > we are really not "searching" for anything, we just do direct > access by > document id. Am I missing anything? Nope. > Finally, users are likely to access termvectors from a given field > only. This may be a good reason to optimize access to each > field_record in the proposed "fvt" file. Perhaps. I had imagined that folks would usually want the term freq vector for the whole document, rather than field by field. If you did things this way it would remove the additional redirection. Doug