[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-05 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14235404#comment-14235404
 ] 

Adrien Grand commented on LUCENE-5914:
--

+1 to the patch

Looks like my comment made you angry, but it was really just something I was 
wondering about. I'm fine with keeping things this way.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236556#comment-14236556
 ] 

ASF subversion and git services commented on LUCENE-5914:
-

Commit 1643490 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1643490 ]

LUCENE-5914: More options for stored fields compression

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-05 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236574#comment-14236574
 ] 

ASF subversion and git services commented on LUCENE-5914:
-

Commit 1643491 from [~rcmuir] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1643491 ]

LUCENE-5914: More options for stored fields compression

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-04 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234059#comment-14234059
 ] 

Adrien Grand commented on LUCENE-5914:
--

+1 thanks Robert. I am good with a constructor argument, the only reason why I 
initially added it as a protected method was to be consistent with postings and 
doc values formats.

Regarding the patch, it just feels weird to me to have this 
{{Objects.requireNonNull(mode)}} validation in the {{Lucene50Codec(Mode)}} 
constructor, I would have expeced it to be solely the responsibility of the 
Lucene50StoredFieldsFormat constructor?

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-04 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234141#comment-14234141
 ] 

Robert Muir commented on LUCENE-5914:
-

I want the null check robust to code changes. Both classes are public. I am 
afraid someone will remove them thinking it makes something faster.

We should add tests for this too.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-02 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231144#comment-14231144
 ] 

Adrien Grand commented on LUCENE-5914:
--

For the benchmarks, there are two reasons I think:
 - merging is not optimized to the byte level at all (yet?)
 - making each document compressible on its own makes the compressing more 
complicated (hence slow)

Let's move forward with your idea to reuse the lz4 hc format as a first step. I 
will try to improve the shared dictionary approach in a different issue.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231163#comment-14231163
 ] 

Robert Muir commented on LUCENE-5914:
-

the benchmark I did here are for HighCompressingCodec aka 
CompressionMode.HIGH_COMPRESSION which is actually deflate, not lz4 hc. I 
benchmarked lz4 hc as well, but i didn't seem like the right tradeoff, slow 
writes but for not much benefit.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-02 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232189#comment-14232189
 ] 

Adrien Grand commented on LUCENE-5914:
--

This looks like a good start! Nice that we can use segment attributes now!

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-01 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229698#comment-14229698
 ] 

Adrien Grand commented on LUCENE-5914:
--

bq. Its also a bit wierd to be slurping in 3 read-once metadata files here. 
This adds complexity at read... the current format is simpler here with just a 
single file. Can we avoid this?

I tried to look into it, but it's not easy. Lucene41 has its own custom stored 
fields index, which is mostly the same thing as MonotonicBlockPackReader, so 
for this new codec, I wanted to move the index to MonotonicBlockPackReader.

The index for stored fields basically stores two pieces of information: the 
first doc ID for each block, and the first start pointer for each block. In 
Lucene41, blocks were interleaved, but this is not something that the 
MonotonicBlockPackWriter allows for, this is why there are 2 files: one for doc 
IDs and one for start pointers. Second limitation, at read time, you need to 
know up-front how many values the MonotonicBlockPackReader stores in order to 
be able to decode it. This is why we have a 3rd file for metadata that stores 
the number of documents in the segment upon call to StoredFieldsWriter.finish.

I agree having 3 read-only files might look strange, but it's probably better 
than having to re-implement specialized monotonic encoding?

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229708#comment-14229708
 ] 

Robert Muir commented on LUCENE-5914:
-

{quote}
Second limitation, at read time, you need to know up-front how many values the 
MonotonicBlockPackReader stores in order to be able to decode it. This is why 
we have a 3rd file for metadata that stores the number of documents in the 
segment upon call to StoredFieldsWriter.finish.
{quote}

But this is redundant with SegmentInfo?

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-01 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229716#comment-14229716
 ] 

Adrien Grand commented on LUCENE-5914:
--

Sorry, I meant number of _blocks_ instead of number of _documents_.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231040#comment-14231040
 ] 

Robert Muir commented on LUCENE-5914:
-

I opened LUCENE-6085 for the SI.attributes, which should help with cleanup.

I ran some benchmarks on various datasets to get an idea where this is at, they 
are disappointing. For geonames, the new format increases size of the stored 
fields 50%, for apache http server logs, it doubles the size. Indexing time is 
significantly slower for any datasets i test as well: there must be bugs in the 
lz4+shared dictionary?

||impl||size||index time||force merge time||
|trunk|372,845,278|101,745|15,976|
|patch(BEST_SPEED)|780,861,727|141,699|60,114|
|patch(BEST_COMPRESSION)|265,063,340|132,238|53,561|

To confirm its a bug and not just the cost of additional i/o (due to less 
compression with shared dictionaries), i set deflate level to 0, and indexed 
with the BEST_COMPRESSION layout to really jack up the size. Sure, it created a 
1.8GB stored field file, but in 126,093ms with 44,377ms merging. This is faster 
than both the options in the patch...

Anyway, this leads to more questions:
* Do we really need a completely separate lz4 impl for the shared dictionaries 
support? Its tough to understand e.g. why it reimplements the hashtable 
differently and so on.
* Do we really need to share code between different stored fields impls that 
have different use-cases and goals? I think the patch currently overshares 
here, and the additional abstractions make it hard to work with. 
* Along with the sharing approach above: we can still reuse code between 
formats though. for example the document-byte stuff could be shared static 
methods. I would just avoid subclassing and interfaces because I get lost in 
the patch too easily. And we just need to be careful that any shared code is 
simple and clear because we have to assume the formats will evolve overtime.
* We shouldnt wrap the deflate case with zlib header/footer. This saves a 
little bit.

About the oversharing issue: I really think the separate formats should just be 
separate formats, it will make life easier. Its more than just a difference in 
compression algorithm and we shouldn't try to structure things so that can just 
be swapped in, i think its not the right tradeoff.

For example, with high compression its more important to lay it out in a way 
where bulk-merge doesn't cause re-compressions, even if it causes 'temporary' 
waste along segment boundaries. This is important because compression here gets 
very costly, and for e.g. archiving case, bulk merge should be potent as 
there shouldnt be so many deletions: we shouldnt bear the cost of 
re-compressing over and over. This gets much much worse if you try to use 
something better than gzip, too. 

On the other hand with low compression, we should ensure merging is still fast 
even in the presence of deletions: the shared dictionary approach is one way, 
another way is to just have at least the getMergeInstance() remember the 
current block and have seek within block optimization, which is probably 
simpler and better than what trunk does today.


 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231066#comment-14231066
 ] 

Robert Muir commented on LUCENE-5914:
-

I also want to propose a new way to proceed here. In my opinion this issue 
tries to do a lot at once:

* make changes to the default codec
* support high and low compression options in the default codec with backwards 
compatibility
* provide some easy way to choose between supported options without having to 
use FilterCodec
* new lz4 implementation
* new deflate implementation

I think its too scary to do all at once. I would prefer we start by exposing 
the current CompressionMode.HIGH_COMPRESSION as the high compression option. 
At least for that one test dataset i used above (2GB highly compressible apache 
server logs), this is reasonably competitive with the deflate option on this 
issue:
||impl||size||index time||force merge time||
|trunk_HC|275,262,504|143,264|49,030|

But more importantly, HighCompressingCodec has been baking in our test suite 
for years, scary bugs knocked out of it, etc.
I think we should first figure out the plumbing to expose that, its something 
we could realistically do for lucene 5.0 and have confidence in. There is still 
plenty to do there to make that option work: exposing the configuration option, 
addressing concerns about back compat testing (we should generate back compat 
indexes both ways), and so on. But at least there is a huge head start on 
testing, code correctness, etc: its baked.

For new proposed formats (LZ4 with shared dictionary, deflate, whatever), I 
think we should address each one individually, adding to the codecs/ package 
first / getting into tests / baking in a similar way... doesn't need to be 
years but we should split these concerns.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-12-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231118#comment-14231118
 ] 

Robert Muir commented on LUCENE-5914:
-

Applying the tuning parameters proposed for the deflate case here to the trunk 
code is a trivial and safe patch, and more compelling:

||impl||size||index time||force merge time||
|trunk_HC_level3_24576|269,857,651|118,150|28,313|


 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch, LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-11-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229170#comment-14229170
 ] 

Robert Muir commented on LUCENE-5914:
-

I will think more on the header check logic. On one hand it would be nice if we 
could figure out an idiom for {{CodecUtil.checkFooter(Throwable, Input...)}} 
that would be safe, but every option I think of is crazier than the current 
stuff.

Its also a bit wierd to be slurping in 3 read-once metadata files here. This 
adds complexity at read... the current format is simpler here with just a 
single file. Can we avoid this?

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 5.0

 Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
 LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-04 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121459#comment-14121459
 ] 

Erick Erickson commented on LUCENE-5914:


Here's an example of what I was remembering. I haven't verified, could be 
totally unrelated to compression. The use-case seems to be very large documents 
(up to 50M) indexing time going from 3 hours to 3 days. There may very well be 
stuff going on here that is unrelated, but...

Skipping the discussion about what the point of 50M docs is ;)

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201409.mbox/%3C27589A9FF2E7CC42B56140FD01CEC3A2EF794C56%40clrc-nt-mbx01p.corp.org.local%3E

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-04 Thread Ryan Ernst (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121721#comment-14121721
 ] 

Ryan Ernst commented on LUCENE-5914:


[~erickerickson] That example doesn't appear to even have stored fields enabled?

{quote}
field name=majorTextSignalStem type=text_stem indexed=true 
*stored=false* multiValued=true
omitNorms=false/
{quote}

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-04 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121916#comment-14121916
 ] 

Erick Erickson commented on LUCENE-5914:


[~rjernst]] Right, that particular field doesn't, I was assuming that other 
fields did.

Hmmm, let me ping him back and we'll see if we can get him to experiment for us.



 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-03 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119918#comment-14119918
 ] 

Adrien Grand commented on LUCENE-5914:
--

bq. Ideally we should enable to use biggish chunk_size during compression to 
improve compression and decompress only single document (not depending on 
chunk_size), just like you proposed here (if I figured it out correctly?)

Exactly, this is one of the two proposed options. The only overhead would be 
that you would need to read the shared dictionary and have it in memory (but 
that is a single call to readBytes and its size can be controlled so that 
should be no issue).

bq. Usually, such data is highly compressible (imagine all these low 
cardinality fields like color of something...) and even some basic compression 
does the magic.

Agreed, this is the reason why I'd prefer the low-overhead option to be 
something cheap rather than no compression at all: data usually has lots of 
patterns and even something as simple as LZ4 manages to reach interesting 
compression ratios.

{quote}
Conclusion: compression is great, and anything that helps tweak this balance 
(CPU effort / IO effort) in different phases indexing/retrieving smoothly makes 
lucene use case coverage broader. (e.g. I want to afford more CPU during 
indexing, and less CPU during retrieval, static coder being extreme case for 
this...)

I am not sure I figured out exactly if and how this patch is going to help in a 
such cases (how to achieve reasonable compression if we do per document 
compression for small documents? Reusing dictionaries from previous chunks? 
static dictionaries... ). 
{quote}

The trade-off that this patch makes is:
 - keep indexing fast enough in all cases
 - allow to trade random access speed to documents for index compression

The current patch provides two options:
  * either we compress documents in blocks like today but with Deflate instead 
of LZ4, this provides good compression ratios but makes random access quite 
slow since you need to decompress a whole block of documents every time you 
want to access a single document
  * either we still group documents into blocks but compress them individually, 
using the compressed representation of the previous documents as a dictionary.

I'll try to explain the 2nd option better: it works well because lz4 mostly 
deduplicates sequences of bytes in a stream. So imagine that you have the 
following 3 documents in a block:
 1. abcdefghabcdwxyz
 2. abcdefghijkl
 3. abcdijklmnop

We will first compress document 1. Given that it is the first document in the 
block, there is no shared dictionary, so the compressed representation look 
like this (`literals` means that bytes are copied as-is, and `ref` means it is 
a reference to a previous sequence of bytes. This is how lz4 works, it just 
replaces existing sequences of bytes with references to previous occurrences of 
the same bytes. The more references you have and the longer they are, the 
better the compression ratio.).

literals:abcdefghref:abcdliterals:wxyz

Now we are going to compress document 2. It doesn't contain any repetition of 
bytes, so if we wanted to compress it individually, we would just have 
literals:abcdefghijkl which doesn't compress at all (and is even slightly 
larger due to the overhead of the format). However, we are using the compressed 
representation of document1 as a dictionary, and abcdefgh exists in the 
literals, so we can make a reference to it!

ref:abcdefghliterals:ijkl

And again for document3 using literals of document1 for abcd, and literals of 
document2 for ijkl:

ref:abcdref:ijklliterals:mnop

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store 

[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-03 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119977#comment-14119977
 ] 

Eks Dev commented on LUCENE-5914:
-

lovely, thanks for explaining, I expected something like this but was not 100% 
sure without looking into code. 
Simply, I see absolutely nothing ono might wish from general, OOTB compression 
support... 

In theory...
The only meaningful enhancements to the standard are possible to come only by 
modelling semantics of the data (the user must know quite a bit about the 
distribution of the data) to improve compression/speed = but this cannot be 
provided by the core, (Lucene is rightly content agnostic), at most the core 
APIs might make it more or less comfortable, but imo nothing more. 

For example (contrived as LZ4 would deal with it quite ok, just to illustrate), 
if I know that my field contains up to 5 distinct string values, I might  add 
simple dictionary coding to use max one byte without even going to codec level. 
The only place where I see theoretical possibility to need to go down-dirty is 
if I would want to reach sub-byte representations (3 bits per value in 
example), but this is rarely needed/hard to beat default LZ4/deflate and also 
even harder not to make slow. At the end of a day, someone who needs this type 
of specialisation should be able to write his own per-field codec.

Great work, and thanks again!

 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-02 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117986#comment-14117986
 ] 

Eks Dev commented on LUCENE-5914:
-

bq. Do you have pointers to emails/irc logs describing such issues?

I do not know what the gold standard lucene usage is, but at least one use case 
I can describe, maybe it helps. I am not proposing anything here, just sharing 
experience. 

Think about the (typical lucene?) usage with structured data (e.g. indexing 
relational db, like product catalog or such) with many smallish fields and then 
retrieving 2k such documents to post-process them, classify, cluster them or 
whatnot (e.g. mahout and co.) 

- Default compression with CHUNK_SIZE makes it decompress 2k * CHUNK_SIZE/2  
bytes on average in order to retrieve 2k Documents 
- Reducing chunk_size helps a lot, but there is a sweet-spot, and if you reduce 
it too much, you will not see enough compression and then your index is not 
fitting into cache , so you get hurt on IO. 

Ideally we should enable to use biggish chunk_size during compression to 
improve compression and decompress only single document (not depending on 
chunk_size), just like you proposed here (if I figured it out correctly?)

Usually, such data is highly compressible (imagine all these low cardinality 
fields like color of something...) and even some basic compression does the 
magic.

What we did?
- Reduced chunk_size
- As a bonus to improve compression, added plain static dictionary compression 
for a few fields in update chain (we store analysed fields)
- When applicable, we pre-sort collection periodically before indexing (on low 
cardinality fields first) this old db-admin secret weapon helps a lot

Conclusion: compression is great, and anything that helps tweak this balance 
(CPU effort / IO effort)  in different phases indexing/retrieving smoothly 
makes lucene use case coverage broader.  (e.g. I want to afford more CPU 
during indexing, and less CPU during retrieval, static coder being extreme 
case for this...)

I am not sure I figured out exactly if and how this patch is going to help in a 
such cases (how to achieve reasonable compression if we do per document 
compression for small documents? Reusing dictionaries from previous chunks? 
static dictionaries... ). 

In any case, thanks for doing the heavy lifting here! I think you already did 
really great job with compression in lucene. 

PS: Ages ago, before lucene, when memory was really expensive, we had our own 
serialization (not in lucene) that simply had one static Huffman coder per 
field (with byte or word symbols), with code-table populated offline,  that was 
great, simple option as it enabled reasonable compression for slow changing 
collections and really fast random access.  
 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117242#comment-14117242
 ] 

Andrzej Bialecki  commented on LUCENE-5914:
---

bq. Do you have pointers to emails/irc logs describing such issues?
I've seen this issue on AWS small instances, where CPU steal time and not I/O 
can easily become the limiting factor.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-01 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117637#comment-14117637
 ] 

Erick Erickson commented on LUCENE-5914:


These were some user's list discussions, can't lay my hands on them right now.

Whether there were work-arounds or not, I've always been uncomfortable with not 
having the option to turn off compression. There are just too many places where 
people do non standard things like store not-very-many huge documents to take 
that option away from them unless we can _guarantee_ that in all configurations 
trading I/O for CPU is A Good Thing. And I don't believe we can make that 
guarantee.

That said, I'm not the one doing the coding so I don't know the ins and outs 
here. If it's easy to turn compression off then I think it's worth doing. If 
it's major surgery OTOH, I don't have any hot complaints to point to so 
insisting that you do the work because of a vaguely-remembered user list 
discussion (that I can't prove there was no work-around for) is just not in the 
cards ;). The response has been write your own codec, so maybe the right 
option is to provide a non-compressing codec? Here's where you tell me there 
already is one ;)

Andrzej points out an interesting case though.



 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-01 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117726#comment-14117726
 ] 

Stefan Pohl commented on LUCENE-5914:
-

This is an interesting discussion based on awesome work, thanks Adrian!

From my experience, LZ4 can actually speed up reading (such as implemented 
here) and working on byte arrays if they have reasonable size, say 10k. There 
can however be some small overhead for short stored fields (e.g. if only an 
int is stored). It remains to be seen how this impl compares to no-compression.

On AWS instances as the ones [~ab] refers to, wouldn't switching off index 
compression be even more helpful, especially after having per-doc decompression 
as implemented here? Lucene and most other IR engines always have been trading 
in CPU for IO/size because index compression is on per default.
This doesn't mean though not to provide the option to users if a case for it 
can be made :)

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-09-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117759#comment-14117759
 ] 

Robert Muir commented on LUCENE-5914:
-

{quote}
This doesn't mean though not to provide the option to users if a case for it 
can be made
{quote}

Its ok to provide such options without hesitation in the codecs/ module, 
however:

We have to be careful, this issue proposes supporting such options *in the 
default codec*.

This is a completely different thing. This means we support such formats for 
years and years and years. Currently as we speak we are trying to release 4.10, 
and it must still be able to read the 3.0 index format from 5 years ago (and we 
had to respin another release candidate because there were bugs in such 
support).

So that's why i point out, if we want to add an option (and i expect we can add 
at most 1 option here feasibly), then tradeoffs have to be made in our 
backwards compatibility support such that we can maintain this stuff and it 
does not spin out of control.


 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-08-31 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116904#comment-14116904
 ] 

Adrien Grand commented on LUCENE-5914:
--

[~erickerickson] [~elyograg] Do you have pointers to emails/irc logs describing 
such issues? Maybe that is something that can be solved without disabling 
compression? By the way, the current patch already improves CPU usage in two 
cases: when doing random access since you can decompress a single document at a 
time, and also sequential access (typically used if you export your current 
dataset using stored fields, or internally by Lucene when merging mixed codecs) 
so maybe that would already help.

[~rcmuir] I will change this.

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-08-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116926#comment-14116926
 ] 

Robert Muir commented on LUCENE-5914:
-

[~jpountz] well I just was curious about the motivation behind it. 

There are advantages and disadvantages to each way, but as separate formats 
each use case would really be able to proceed in its own direction in the 
future. 

With the current patch, BEST_COMPRESSION = Deflate, but what if we wanted to 
replace this with bzip later and still support the old indexes etc (which I 
think is a goal of this issue). 

So I think its better to have separate formats (if they want to share some code 
behind the scenes at the moment, thats ok). If we want to provide back compat 
on both options, then thats something we can decide to do here (IMO, there 
should be a price for the added complexity, such as moving all ancient codecs 
out of lucene/core, dropping 3.x index support, something to keep it all 
manageable).


 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-08-29 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116025#comment-14116025
 ] 

Erick Erickson commented on LUCENE-5914:


Haven't looked at the patch I confess, but is there a way to turn compression 
off completely? I've seen a few situations in the wild where 
compressing/decompressing is taking up large amounts of CPU.

FWIW,
Erick

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-08-29 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116075#comment-14116075
 ] 

Shawn Heisey commented on LUCENE-5914:
--

bq. Haven't looked at the patch I confess, but is there a way to turn 
compression off completely?

+1.

From what I understand, this would be relatively straightforward in Lucene, 
where you can swap low-level components in and out pretty easily, but I'm 
really looking for that option to be user-configurable in Solr.  I know that 
will require a separate issue.  For my index, compression is a good thing, but 
like Erick, I've seen situations with Solr on the list and IRC where it really 
hurts some people.


 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5914) More options for stored fields compression

2014-08-29 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116233#comment-14116233
 ] 

Robert Muir commented on LUCENE-5914:
-

The patch adds conditional logic to the default codec, instead of different 
formats. Why this approach? 

 More options for stored fields compression
 --

 Key: LUCENE-5914
 URL: https://issues.apache.org/jira/browse/LUCENE-5914
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.11

 Attachments: LUCENE-5914.patch


 Since we added codec-level compression in Lucene 4.1 I think I got about the 
 same amount of users complaining that compression was too aggressive and that 
 compression was too light.
 I think it is due to the fact that we have users that are doing very 
 different things with Lucene. For example if you have a small index that fits 
 in the filesystem cache (or is close to), then you might never pay for actual 
 disk seeks and in such a case the fact that the current stored fields format 
 needs to over-decompress data can sensibly slow search down on cheap queries.
 On the other hand, it is more and more common to use Lucene for things like 
 log analytics, and in that case you have huge amounts of data for which you 
 don't care much about stored fields performance. However it is very 
 frustrating to notice that the data that you store takes several times less 
 space when you gzip it compared to your index although Lucene claims to 
 compress stored fields.
 For that reason, I think it would be nice to have some kind of options that 
 would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org