[
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527180#comment-13527180
]
Michael McCandless commented on LUCENE-4599:
--------------------------------------------
bq. Does it make sense to put this in an FST where the key is the term bytes
and the value is what you're doing now for the positions, offsets, and payloads
in a byte array?
That's a neat idea :) We should [almost] just be able to use
MemoryPostingsFormat, since it already stores all postings in an FST.
bq. I think a FST would not compress as much as what LZ4 or Deflate can do? But
maybe it could speed up TermsEnum.seekCeil on large documents so it might be an
interesting idea regarding random access speed?
Likely it would not compress as well, since LZ4/Deflate are able to share
common infix fragments too, but FST only shares prefix/suffix. It'd be
interesting to test ... but we should explore this (FST-backed
TermVectorsFormat) in a new issue I think ... this issue seems awesome enough
already :)
bq. Or... can we simply reference the terms by ord (an int) instead of writing
each term bytes?
Using ords matching the main terms dict is a neat idea too! It would be much
more compact ... but, when reading the term vectors we'd need to resolve-by-ord
against the main terms dictionary (not all postings formats support that: it's
optional, and eg our default PF doesn't), which would likely be slower than
today.
bq. Is that information available somewhere when writing/merging term vectors?
Unfortunately, no. We only assign ords when it's time to flush the segment ...
but we write term vectors "live" as we index each document. If we changed
that, eg buffered up term vectors, then we could get the ords when we wrote
them.
> Compressed term vectors
> -----------------------
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs, core/termvectors
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with
> stored fields.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]