Marvin Humphrey wrote:
On Apr 4, 2006, at 10:23 AM, Tatu Saloranta wrote:
So in this case, what would give more comparable results (assuming
you are interested in measuring likely server-side
usage scenario, which is usually what Lucene is used for)
Actually, I think the benchmark results illustrate that everyone
should be at least mildly concerned about where the Term Vector data
gets stored. KinoSearch only writes that data once. Lucene, however,
has to read/write that data during each merge, and the more streams
you have, the more complex the merge. It stands to reason that
storing term vector data with the stored fields data would speed up
the merge process.
This seems like a good idea., especially combined with the lazy
loading/retrieve specified fields approach that we are proposing, so
that we aren't getting the term vector every time we retrieve a
document. We could deprecate the IndexReader.getTermVector methods and
move it to be accessed via the Field. Not sure what the issues are
completely, but it makes sense, since the TV data is not changing.
Are there any other significant applications?
Clustering. Corpora analysis/browsing. Most likely others
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]