[jira] [Updated] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-06-06 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated LUCENE-5688:
--

Attachment: LUCENE-5688.patch

bq. Can you perhaps make the Consumer/Producer package-private? I think only 
the Format needs to be public?

Done. Lucene45DocValues producer and consumer are public. We could fix that in 
some other issue?

In SparseDocValuesProducer I have implemented 2 ways to get numerics - 
getNumericUsingBinarySearch - uses the binary search approach
getNumericUsingHashMap - uses a hash map based approach

Test passes for both. So I think we should benchmark both approaches. based on 
the results we could pick one approach or even have both on them and pick the 
right stratergy using data from the benchmark results.

bq. It's based on Solr's HashDocSet which I modify to act as an int-to-long 
map. I can share the code here if you want.
That will be great. We can replace the HashMap in getNumericUsingHashMap with 
this.

From what I understand this is how we can run luceneutil benchmark tests - 
- python setup.py -prepareTrunk
- svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk patch  
- Apply the patch in this checkout
- We need a task file and then call searchBench.py with -index and -search


 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch, LUCENE-5688.patch, LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-25 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated LUCENE-5688:
--

Attachment: LUCENE-5688.patch

I created a new DocValue format - SparseDocValuesFormat

I have just added Numeric support till now. Adding other should not take too 
long. Wanted to get some feedback on whether I'm on the right track.

It does a binary search on the position data which is read using 
MonotonicBlockPackedReader.

Also I am not too familiar with lucene-util but is there a test which 
benchmarks DocValue read times? Should be interesting to see the read time 
difference.

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch, LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5688) NumericDocValues fields with sparse data can be compressed better

2014-05-20 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated LUCENE-5688:
--

Attachment: LUCENE-5688.patch

Here is a quick patch. Wanted to get some feedback on the approach.

When I run  the showIndexBloat method without the SPARSE_COMPRESSED changes, 
this is the size of the docValues data - 
{noformat}
-rw-r--r--  1 varun  wheel   9.9M May 20 18:28 _a_Lucene45_0.dvd
-rw-r--r--  1 varun  wheel   312B May 20 18:28 _a_Lucene45_0.dvm
{noformat}

With the SPARSE_COMPRESSED changes
{noformat}
-rw-r--r--  1 varun  wheel   2.7M May 20 18:51 _a_Lucene45_0.dvd
-rw-r--r--  1 varun  wheel   352B May 20 18:51 _a_Lucene45_0.dvm
{noformat}

 NumericDocValues fields with sparse data can be compressed better 
 --

 Key: LUCENE-5688
 URL: https://issues.apache.org/jira/browse/LUCENE-5688
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Varun Thacker
Priority: Minor
 Attachments: LUCENE-5688.patch


 I ran into this problem where I had a dynamic field in Solr and indexed data 
 into lots of fields. For each field only a few documents had actual values 
 and the remaining documents the default value ( 0 ) got indexed. Now when I 
 merge segments, the index size jumps up.
 For example I have 10 segments - Each with 1 DV field. When I merge segments 
 into 1 that segment will contain all 10 DV fields with lots if 0s. 
 This was the motivation behind trying to come up with a compression for a use 
 case like this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org