DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

[PATCH] Term Vector support





------- Additional Comments From [EMAIL PROTECTED]  2004-02-06 21:25 -------
Wow!

I think the idea of removing the Term->int mapping is probably a good one, since
it makes vectors available for all indexes, not just optimized ones, and that's
really a requirement.  It makes things bigger and slower (e.g., a vector
dot-product will have to do string compares) but I think that's probably worth it.

Dmitry, others: what do you think of this approach?

Note that, since the vectors are sorted by term text, you can write them in a
more compact manner by sharing string prefixes.  See, for example,
SegmentTermEnum.readTerm() for an example of how this can be done.

It would be best to include a format version number as the first four bytes of
each file.  I'm trying to add that as we introduce new files or change the
format of existing files.  This will make it much easier to compatibly evolve
the file format.

An description of the new file formats will also be required before we make a
1.4 release.  Can you draft something up about this?

I haven't actually applied the patch or tried to run this yet.  One thing I
note, in glancing at the code, is that it looks like you read the positions even
when they're not asked for.  (Or did I miss something.)  It would be best if
this could be avoided as it adds file i/o and increases the in-memory size of
vectors.  Lots of vector-based computations don't care about positions.

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to