[ 
https://issues.apache.org/jira/browse/LUCENE-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated LUCENE-6779:
------------------------------------------
    Attachment: LUCENE-6779.patch

This patch is based on Robert's earlier patch but I fallback to the double pass 
approach only if string is larger than 64KB. This patch also moves 
GrowableByteArrayDataOutput from the util to codecs.compressing package as 
suggested.

I benchmarked both approaches (i.e. use double pass always vs use single pass 
below 64KB) against test data generated using 
TestUtil.randomRealisticUnicodeString between 5 and 64 characters) and for such 
short fields, double pass is approx 30% slower. I don't think short fields 
should pay this penalty considering those should be far more common.

{code}
testWriteString1 = Use double pass always
testWriteString2 = Use double pass if utf8 size is greater than 64KB
testWriteStringDefault = Use writeString from base DataOutput class

10K Randomly generated strings (5 <= len <= 64)
======================================
java -server -Xmx2048M -Xms2048M -Dtests.seed=18262 
-Dtests.datagen.path=./data.txt -Dtests.string.minlen=5 
-Dtests.string.maxlen=64 -Dtests.string.num=10000 -jar target/benchmarks.jar 
-wi 5 -i 50 -gc true -f 2 -prof gc ".*GrowableByteArrayDataOutputBenchmark.*"

# Run complete. Total time: 00:06:41

Benchmark                                                                       
  Mode  Cnt        Score       Error   Units
GrowableByteArrayDataOutputBenchmark.testWriteString1                           
 thrpt  100  2916182.627 ±  5219.401   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate            
 thrpt  100        0.001 ±     0.001  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate.norm       
 thrpt  100       ≈ 10⁻⁴                B/op
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.count                 
 thrpt  100          ≈ 0              counts
GrowableByteArrayDataOutputBenchmark.testWriteString2                           
 thrpt  100  4226084.451 ±  7188.594   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate            
 thrpt  100      596.567 ±     1.016  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate.norm       
 thrpt  100      148.060 ±     0.001    B/op
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.count                 
 thrpt  100          ≈ 0              counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault                     
 thrpt  100  4221729.873 ± 13558.316   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate      
 thrpt  100      595.961 ±     1.916  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate.norm 
 thrpt  100      148.060 ±     0.001    B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.count           
 thrpt  100        1.000              counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.time            
 thrpt  100       19.000                  ms

10MB latin-1 field
=============
java -server -Xmx2048M -Xms2048M -Dtests.seed=18262 -Dtests.string.num=0 
-Dtests.json.path=./input14.json -jar target/benchmarks.jar -wi 5 -i 50 -gc 
true -f 2 -prof gc ".*GrowableByteArrayDataOutputBenchmark.*

# Run complete. Total time: 00:06:47

Benchmark                                                                       
           Mode  Cnt         Score        Error   Units
GrowableByteArrayDataOutputBenchmark.testWriteString1                           
          thrpt  100        27.985 ±      0.074   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate            
          thrpt  100         0.001 ±      0.001  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate.norm       
          thrpt  100        24.951 ±     20.652    B/op
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.count                 
          thrpt  100           ≈ 0               counts
GrowableByteArrayDataOutputBenchmark.testWriteString2                           
          thrpt  100        28.105 ±      0.090   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate            
          thrpt  100         0.001 ±      0.001  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate.norm       
          thrpt  100        24.888 ±     20.655    B/op
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.count                 
          thrpt  100           ≈ 0               counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault                     
          thrpt  100        36.185 ±      0.099   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate      
          thrpt  100      1123.864 ±      3.077  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate.norm 
          thrpt  100  32575891.405 ±     16.168    B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.churn.PS_Eden_Space
       thrpt  100       645.241 ±      7.098  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.churn.PS_Eden_Space.norm
  thrpt  100  18703213.617 ± 205570.201    B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.count           
          thrpt  100       100.000               counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.time            
          thrpt  100       299.000                   ms

100MB latin-1 string
================
java -server -Xmx2048M -Xms2048M -Dtests.seed=18262 -Dtests.string.num=0 
-Dtests.json.path=./input140.json -jar target/benchmarks.jar -wi 5 -i 50 -gc 
true -f 2 -prof gc ".*GrowableByteArrayDataOutputBenchmark.*

# Run complete. Total time: 00:07:14

Benchmark                                                                       
  Mode  Cnt          Score     Error   Units
GrowableByteArrayDataOutputBenchmark.testWriteString1                           
 thrpt  100          2.814 ±   0.008   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate            
 thrpt  100          0.001 ±   0.001  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate.norm       
 thrpt  100        236.853 ± 196.100    B/op
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.count                 
 thrpt  100            ≈ 0            counts
GrowableByteArrayDataOutputBenchmark.testWriteString2                           
 thrpt  100          2.811 ±   0.022   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate            
 thrpt  100          0.001 ±   0.001  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate.norm       
 thrpt  100        236.853 ± 196.100    B/op
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.count                 
 thrpt  100            ≈ 0            counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault                     
 thrpt  100          3.617 ±   0.009   ops/s
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate      
 thrpt  100       1123.487 ±   2.667  MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate.norm 
 thrpt  100  325758521.800 ± 147.457    B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.count           
 thrpt  100            ≈ 0            counts
{code}

> Reduce memory allocated by CompressingStoredFieldsWriter to write large 
> strings
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-6779
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6779
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Shalin Shekhar Mangar
>         Attachments: LUCENE-6779.patch, LUCENE-6779.patch, 
> LUCENE-6779_alt.patch
>
>
> In SOLR-7927, I am trying to reduce the memory required to index very large 
> documents (between 10 to 100MB) and one of the places which allocate a lot of 
> heap is the UTF8 encoding in CompressingStoredFieldsWriter. The same problem 
> existed in JavaBinCodec and we reduced its memory allocation by falling back 
> to a double pass approach in SOLR-7971 when the utf8 size of the string is 
> greater than 64KB.
> I propose to make the same changes to CompressingStoredFieldsWriter as we 
> made to JavaBinCodec in SOLR-7971.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to