[
https://issues.apache.org/jira/browse/LUCENE-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated LUCENE-6779:
------------------------------------------
Attachment: LUCENE-6779.patch
This patch is based on Robert's earlier patch but I fallback to the double pass
approach only if string is larger than 64KB. This patch also moves
GrowableByteArrayDataOutput from the util to codecs.compressing package as
suggested.
I benchmarked both approaches (i.e. use double pass always vs use single pass
below 64KB) against test data generated using
TestUtil.randomRealisticUnicodeString between 5 and 64 characters) and for such
short fields, double pass is approx 30% slower. I don't think short fields
should pay this penalty considering those should be far more common.
{code}
testWriteString1 = Use double pass always
testWriteString2 = Use double pass if utf8 size is greater than 64KB
testWriteStringDefault = Use writeString from base DataOutput class
10K Randomly generated strings (5 <= len <= 64)
======================================
java -server -Xmx2048M -Xms2048M -Dtests.seed=18262
-Dtests.datagen.path=./data.txt -Dtests.string.minlen=5
-Dtests.string.maxlen=64 -Dtests.string.num=10000 -jar target/benchmarks.jar
-wi 5 -i 50 -gc true -f 2 -prof gc ".*GrowableByteArrayDataOutputBenchmark.*"
# Run complete. Total time: 00:06:41
Benchmark
Mode Cnt Score Error Units
GrowableByteArrayDataOutputBenchmark.testWriteString1
thrpt 100 2916182.627 ± 5219.401 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate
thrpt 100 0.001 ± 0.001 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate.norm
thrpt 100 ≈ 10⁻⁴ B/op
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.count
thrpt 100 ≈ 0 counts
GrowableByteArrayDataOutputBenchmark.testWriteString2
thrpt 100 4226084.451 ± 7188.594 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate
thrpt 100 596.567 ± 1.016 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate.norm
thrpt 100 148.060 ± 0.001 B/op
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.count
thrpt 100 ≈ 0 counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault
thrpt 100 4221729.873 ± 13558.316 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate
thrpt 100 595.961 ± 1.916 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate.norm
thrpt 100 148.060 ± 0.001 B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.count
thrpt 100 1.000 counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.time
thrpt 100 19.000 ms
10MB latin-1 field
=============
java -server -Xmx2048M -Xms2048M -Dtests.seed=18262 -Dtests.string.num=0
-Dtests.json.path=./input14.json -jar target/benchmarks.jar -wi 5 -i 50 -gc
true -f 2 -prof gc ".*GrowableByteArrayDataOutputBenchmark.*
# Run complete. Total time: 00:06:47
Benchmark
Mode Cnt Score Error Units
GrowableByteArrayDataOutputBenchmark.testWriteString1
thrpt 100 27.985 ± 0.074 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate
thrpt 100 0.001 ± 0.001 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate.norm
thrpt 100 24.951 ± 20.652 B/op
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.count
thrpt 100 ≈ 0 counts
GrowableByteArrayDataOutputBenchmark.testWriteString2
thrpt 100 28.105 ± 0.090 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate
thrpt 100 0.001 ± 0.001 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate.norm
thrpt 100 24.888 ± 20.655 B/op
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.count
thrpt 100 ≈ 0 counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault
thrpt 100 36.185 ± 0.099 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate
thrpt 100 1123.864 ± 3.077 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate.norm
thrpt 100 32575891.405 ± 16.168 B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.churn.PS_Eden_Space
thrpt 100 645.241 ± 7.098 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.churn.PS_Eden_Space.norm
thrpt 100 18703213.617 ± 205570.201 B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.count
thrpt 100 100.000 counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.time
thrpt 100 299.000 ms
100MB latin-1 string
================
java -server -Xmx2048M -Xms2048M -Dtests.seed=18262 -Dtests.string.num=0
-Dtests.json.path=./input140.json -jar target/benchmarks.jar -wi 5 -i 50 -gc
true -f 2 -prof gc ".*GrowableByteArrayDataOutputBenchmark.*
# Run complete. Total time: 00:07:14
Benchmark
Mode Cnt Score Error Units
GrowableByteArrayDataOutputBenchmark.testWriteString1
thrpt 100 2.814 ± 0.008 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate
thrpt 100 0.001 ± 0.001 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.alloc.rate.norm
thrpt 100 236.853 ± 196.100 B/op
GrowableByteArrayDataOutputBenchmark.testWriteString1:·gc.count
thrpt 100 ≈ 0 counts
GrowableByteArrayDataOutputBenchmark.testWriteString2
thrpt 100 2.811 ± 0.022 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate
thrpt 100 0.001 ± 0.001 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.alloc.rate.norm
thrpt 100 236.853 ± 196.100 B/op
GrowableByteArrayDataOutputBenchmark.testWriteString2:·gc.count
thrpt 100 ≈ 0 counts
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault
thrpt 100 3.617 ± 0.009 ops/s
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate
thrpt 100 1123.487 ± 2.667 MB/sec
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.alloc.rate.norm
thrpt 100 325758521.800 ± 147.457 B/op
GrowableByteArrayDataOutputBenchmark.testWriteStringDefault:·gc.count
thrpt 100 ≈ 0 counts
{code}
> Reduce memory allocated by CompressingStoredFieldsWriter to write large
> strings
> -------------------------------------------------------------------------------
>
> Key: LUCENE-6779
> URL: https://issues.apache.org/jira/browse/LUCENE-6779
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Reporter: Shalin Shekhar Mangar
> Attachments: LUCENE-6779.patch, LUCENE-6779.patch,
> LUCENE-6779_alt.patch
>
>
> In SOLR-7927, I am trying to reduce the memory required to index very large
> documents (between 10 to 100MB) and one of the places which allocate a lot of
> heap is the UTF8 encoding in CompressingStoredFieldsWriter. The same problem
> existed in JavaBinCodec and we reduced its memory allocation by falling back
> to a double pass approach in SOLR-7971 when the utf8 size of the string is
> greater than 64KB.
> I propose to make the same changes to CompressingStoredFieldsWriter as we
> made to JavaBinCodec in SOLR-7971.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]