Vectors are totally per-document. Its hard to do anything smarter with them. Basically by this i mean, IMO vectors aren't going to get better until the semantics around them improves. From the original fileformats, i get the impression they were modelled after stored fields a lot, and I think thats why they will be as slow as stored fields until things are fixed.
* removing the embedded per-document schema of vectors. I can't imagine a use case for this. I think in general you either have vectors for docs in a given field X or you do not. * removing the ability to store broken offsets (going backward, etc) into vectors. * removing the ability to store offsets without positions. Why? As far as the current impl, its fallen behind the stored fields, which got a lot of improvements for 5.0. We at least gave it a little love, it has a super-fast bulk merge when no deletions are present (dirtyChunks, totalChunks, etc). But in all other cases its very expensive. Compression block sizes, etc should be tuned. It should implement getMergeInstance() and keep state to avoid shittons of decompressions on merge. Maybe a high compression option should be looked at, though getMergeInstance() should be a prerequisite for that or it will be too slow. When the format matches between readers (typically the case, except when upgrading from older versions etc), it should avoid deserialization overhead if that is costly (still the case for stored fields). Fixing some of the big problems (lots of metadata/complexity needed for embedded schema info, negative numbers when there should not be) with vectors would also enable better compression, maybe even underneath LZ4, like stored fields got in 5.0 too. On Thu, Apr 2, 2015 at 2:51 PM, david.w.smi...@gmail.com <david.w.smi...@gmail.com> wrote: > I was looking at a JIRA issue someone posted pertaining to optimizing > highlighting for when there are term vectors ( SOLR-5855 ). I dug into the > details a bit and learned something unexpected: > CompressingTermVectorsReader.get(docId) fully loads all term vectors for the > document. The client/user consuming code in question might just want the > term vectors for a subset of all fields that have term vectors. Was this > overlooked or are there benefits to the current approach? I can’t think of > any except that perhaps there’s better compression over all the data versus > in smaller per-field chunks; although I’d trade that any day over being able > to just get a subset of fields. I could imagine it being useful to ask for > some fields or all — in much the same way we handle stored field data. > > ~ David Smiley > Freelance Apache Lucene/Solr Search Consultant/Developer > http://www.linkedin.com/in/davidwsmiley --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org