[
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975536#comment-16975536
]
Yifan Cai commented on CASSANDRA-15410:
---------------------------------------
Good idea! I have updated the PR accordingly. The benchmark result indicates
the new {{CBUtil.writeAsciiString}} method has a similar performance.
{code:java}
[java] Benchmark Mode Cnt
Score Error Units
[java] StringsEncodeBench.writeLongText avgt 6
570.920 ± 21.376 ns/op
[java] StringsEncodeBench.writeLongTextAsASCII avgt 6
291.466 ± 9.508 ns/op
[java] StringsEncodeBench.writeLongTextWithExactSize avgt 6
467.222 ± 25.140 ns/op
[java] StringsEncodeBench.writeLongTextWithExactSizeSkipCalc avgt 6
285.320 ± 10.883 ns/op
[java] StringsEncodeBench.writeShortText avgt 6
62.076 ± 2.107 ns/op
[java] StringsEncodeBench.writeShortTextAsASCII avgt 6
32.121 ± 0.403 ns/op
[java] StringsEncodeBench.writeShortTextWithExactSize avgt 6
41.929 ± 1.783 ns/op
[java] StringsEncodeBench.writeShortTextWithExactSizeSkipCalc avgt 6
34.638 ± 0.455 ns/op
{code}
Changes included in the latest commit:
* added {{CBUtil.writeAsciiString}}.
* updated the code to call the new method whenever applicable, i.e. the string
is certain to only contain char from (0, 127].
* added unit test cases for the new method.
One thing to note is that {{ByteBufUtil#writeAscii()}} allows the extend 8-bits
ASCII character, meanwhile, cassandra's ASCII data type follows the US-ASCII,
which is 7-bits. If the input string contains any character with value >
{{0x007F}}, {{ByteBufUtil#writeAscii()}} is going to encode it as 1 byte.
However, cassandra's encoding considers it as 2 bytes. It leads to errors when
reading the string.
Client that calls {{CBUtil.writeAsciiString}} needs to be certain that the data
type is ascii.
I tried to add the validation step to check if the characters are US-ASCII
compatible before writing, and add a boolean parameter to skip validation for
the cases that I am sure the string is compatible. It ends up with all
references of the method have the parameter set to skip, which looks slightly
messy.
I think for the places we want to use {{writeAsciiString}} over
{{writeString}}, we already know the encoding of the string. So the parameter
was omitted.
Last but not the least, the root issue is that the {{ecodeSize}} is not a
strong constraint when encoding. The encoding method cannot safely rely on the
information. One improvement could be automate the calculation of the message
size. Right now, the size was calculated manually by adding up the size of each
field and *it is really error-prone*. Surely, it is outside the scope of this
ticket.
> Avoid over-allocation of bytes for UTF8 string serialization
> -------------------------------------------------------------
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
> Issue Type: Improvement
> Components: Messaging/Client
> Reporter: Yifan Cai
> Assignee: Yifan Cai
> Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the
> `encodeSize` and allocates the bytebuffer with that size.
> However, during encoding, it assumes the worst case of writing UTF8 string to
> allocate bytes, i.e. assuming each letter takes 3 bytes.
> The over-estimation further leads to resizing the underlying array and data
> copy.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]