[
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231040#comment-14231040
]
Robert Muir commented on LUCENE-5914:
-------------------------------------
I opened LUCENE-6085 for the SI.attributes, which should help with cleanup.
I ran some benchmarks on various datasets to get an idea where this is at, they
are disappointing. For geonames, the new format increases size of the stored
fields 50%, for apache http server logs, it doubles the size. Indexing time is
significantly slower for any datasets i test as well: there must be bugs in the
lz4+shared dictionary?
||impl||size||index time||force merge time||
|trunk|372,845,278|101,745|15,976|
|patch(BEST_SPEED)|780,861,727|141,699|60,114|
|patch(BEST_COMPRESSION)|265,063,340|132,238|53,561|
To confirm its a bug and not just the cost of additional i/o (due to less
compression with shared dictionaries), i set deflate level to 0, and indexed
with the BEST_COMPRESSION layout to really jack up the size. Sure, it created a
1.8GB stored field file, but in 126,093ms with 44,377ms merging. This is faster
than both the options in the patch...
Anyway, this leads to more questions:
* Do we really need a completely separate lz4 impl for the shared dictionaries
support? Its tough to understand e.g. why it reimplements the hashtable
differently and so on.
* Do we really need to share code between different stored fields impls that
have different use-cases and goals? I think the patch currently overshares
here, and the additional abstractions make it hard to work with.
* Along with the sharing approach above: we can still reuse code between
formats though. for example the document<->byte stuff could be shared static
methods. I would just avoid subclassing and interfaces because I get lost in
the patch too easily. And we just need to be careful that any shared code is
simple and clear because we have to assume the formats will evolve overtime.
* We shouldnt wrap the deflate case with zlib header/footer. This saves a
little bit.
About the oversharing issue: I really think the separate formats should just be
separate formats, it will make life easier. Its more than just a difference in
compression algorithm and we shouldn't try to structure things so that can just
be swapped in, i think its not the right tradeoff.
For example, with high compression its more important to lay it out in a way
where bulk-merge doesn't cause re-compressions, even if it causes 'temporary'
waste along segment boundaries. This is important because compression here gets
very costly, and for e.g. "archiving" case, bulk merge should be potent as
there shouldnt be so many deletions: we shouldnt bear the cost of
re-compressing over and over. This gets much much worse if you try to use
something "better" than gzip, too.
On the other hand with low compression, we should ensure merging is still fast
even in the presence of deletions: the shared dictionary approach is one way,
another way is to just have at least the getMergeInstance() remember the
current block and have "seek within block" optimization, which is probably
simpler and better than what trunk does today.
> More options for stored fields compression
> ------------------------------------------
>
> Key: LUCENE-5914
> URL: https://issues.apache.org/jira/browse/LUCENE-5914
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Fix For: 5.0
>
> Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch,
> LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the
> same amount of users complaining that compression was too aggressive and that
> compression was too light.
> I think it is due to the fact that we have users that are doing very
> different things with Lucene. For example if you have a small index that fits
> in the filesystem cache (or is close to), then you might never pay for actual
> disk seeks and in such a case the fact that the current stored fields format
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like
> log analytics, and in that case you have huge amounts of data for which you
> don't care much about stored fields performance. However it is very
> frustrating to notice that the data that you store takes several times less
> space when you gzip it compared to your index although Lucene claims to
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that
> would allow to trade speed for compression in the default codec.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]