Shalin Shekhar Mangar created SOLR-7971:
-------------------------------------------
Summary: Reduce memory allocated by JavaBinCodec to encode large
strings
Key: SOLR-7971
URL: https://issues.apache.org/jira/browse/SOLR-7971
Project: Solr
Issue Type: Sub-task
Components: Response Writers, SolrCloud
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
Priority: Minor
Fix For: Trunk, 5.4
As discussed in SOLR-7927, we can reduce the buffer memory allocated by
JavaBinCodec while writing large strings.
https://issues.apache.org/jira/browse/SOLR-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700420#comment-14700420
{quote}
The maximum Unicode code point (as of Unicode 8 anyway) is U+10FFFF
([http://www.unicode.org/glossary/#code_point]). This is encoded in UTF-16 as
surrogate pair {{\uDBFF\uDFFF}}, which takes up two Java chars, and is
represented in UTF-8 as the 4-byte sequence {{F4 8F BF BF}}. This is likely
where the mistaken 4-bytes-per-Java-char formulation came from: the maximum
number of UTF-8 bytes required to represent a Unicode *code point* is 4.
The maximum Java char is {{\uFFFF}}, which is represented in UTF-8 as the
3-byte sequence {{EF BF BF}}.
So I think it's safe to switch to using 3 bytes per Java char (the unit of
measurement returned by {{String.length()}}), like
{{CompressingStoredFieldsWriter.writeField()}} does.
{quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]