DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927 [PATCH] Term Vector support ------- Additional Comments From [EMAIL PROTECTED] 2004-02-06 21:25 ------- Wow! I think the idea of removing the Term->int mapping is probably a good one, since it makes vectors available for all indexes, not just optimized ones, and that's really a requirement. It makes things bigger and slower (e.g., a vector dot-product will have to do string compares) but I think that's probably worth it. Dmitry, others: what do you think of this approach? Note that, since the vectors are sorted by term text, you can write them in a more compact manner by sharing string prefixes. See, for example, SegmentTermEnum.readTerm() for an example of how this can be done. It would be best to include a format version number as the first four bytes of each file. I'm trying to add that as we introduce new files or change the format of existing files. This will make it much easier to compatibly evolve the file format. An description of the new file formats will also be required before we make a 1.4 release. Can you draft something up about this? I haven't actually applied the patch or tried to run this yet. One thing I note, in glancing at the code, is that it looks like you read the positions even when they're not asked for. (Or did I miss something.) It would be best if this could be avoided as it adds file i/o and increases the in-memory size of vectors. Lots of vector-based computations don't care about positions. Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]