[ 
https://issues.apache.org/jira/browse/LUCENE-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730001#comment-14730001
 ] 

Robert Muir commented on LUCENE-6779:
-------------------------------------

{quote}
Yes but that actually has better performance than writing bytes directly to the 
DataOutput. I tested this with JavaBinCodec and I don't think performance will 
be very different here (see JMH benchmark results in SOLR-7971). Presumably, 
the huge amount of invocations of writeByte don't perform well compared to 
setting a byte in a scratch array directly.
{quote}

Its unclear to me the complexity is worth it. The data used in the benchmark is 
100% latin-1 (completely english), which certainly isn't representative of 
reality, so the benchmarks don't mean anything to me.

One thing to keep in mind is it only affects writes from string fields on 
flush, during merging, bulk copying can kick in, and even in the worst case 
where that can't happen, we really shouldn't be taking this codepath anyway, we 
should just do byte[] -> byte[] for string fields 

So I'm not sure if the optimization (especially for 10MB documents which is a 
ridiculous case) is really that powerful? We have to be extremely careful here 
because with these kind of optimizations, any bug -> data corruption.

> Reduce memory allocated by CompressingStoredFieldsWriter to write large 
> strings
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-6779
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6779
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Shalin Shekhar Mangar
>         Attachments: LUCENE-6779.patch
>
>
> In SOLR-7927, I am trying to reduce the memory required to index very large 
> documents (between 10 to 100MB) and one of the places which allocate a lot of 
> heap is the UTF8 encoding in CompressingStoredFieldsWriter. The same problem 
> existed in JavaBinCodec and we reduced its memory allocation by falling back 
> to a double pass approach in SOLR-7971 when the utf8 size of the string is 
> greater than 64KB.
> I propose to make the same changes to CompressingStoredFieldsWriter as we 
> made to JavaBinCodec in SOLR-7971.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to