[ https://issues.apache.org/jira/browse/LUCENE-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730001#comment-14730001 ]
Robert Muir commented on LUCENE-6779: ------------------------------------- {quote} Yes but that actually has better performance than writing bytes directly to the DataOutput. I tested this with JavaBinCodec and I don't think performance will be very different here (see JMH benchmark results in SOLR-7971). Presumably, the huge amount of invocations of writeByte don't perform well compared to setting a byte in a scratch array directly. {quote} Its unclear to me the complexity is worth it. The data used in the benchmark is 100% latin-1 (completely english), which certainly isn't representative of reality, so the benchmarks don't mean anything to me. One thing to keep in mind is it only affects writes from string fields on flush, during merging, bulk copying can kick in, and even in the worst case where that can't happen, we really shouldn't be taking this codepath anyway, we should just do byte[] -> byte[] for string fields So I'm not sure if the optimization (especially for 10MB documents which is a ridiculous case) is really that powerful? We have to be extremely careful here because with these kind of optimizations, any bug -> data corruption. > Reduce memory allocated by CompressingStoredFieldsWriter to write large > strings > ------------------------------------------------------------------------------- > > Key: LUCENE-6779 > URL: https://issues.apache.org/jira/browse/LUCENE-6779 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Reporter: Shalin Shekhar Mangar > Attachments: LUCENE-6779.patch > > > In SOLR-7927, I am trying to reduce the memory required to index very large > documents (between 10 to 100MB) and one of the places which allocate a lot of > heap is the UTF8 encoding in CompressingStoredFieldsWriter. The same problem > existed in JavaBinCodec and we reduced its memory allocation by falling back > to a double pass approach in SOLR-7971 when the utf8 size of the string is > greater than 64KB. > I propose to make the same changes to CompressingStoredFieldsWriter as we > made to JavaBinCodec in SOLR-7971. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org