[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977596#comment-16977596 ] Aleksey Yeschenko commented on CASSANDRA-15410: --- Committed to {{trunk}} as [647bdd6a11970f80666d7f20b53af76fbda4ff14|https://github.com/apache/cassandra/commit/647bdd6a11970f80666d7f20b53af76fbda4ff14], thanks. Made just one tiny tweak: switched encoding of UDT names and field types to ascii strings as well. > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976776#comment-16976776 ] Yifan Cai commented on CASSANDRA-15410: --- Updated the PR with the {{sizeOfAsciiString()}} method. Since the input of the method is a US-ASCII string, the method simply returns {{2 (size) + str.length}}. Therefore, 2 orders of magnitude faster than {{sizeOfString()}} that iterates through the string. {code:java} [java] Benchmark Mode Cnt ScoreError Units [java] StringsEncodeBench.sizeOfAsciiStringavgt6 1.999 ± 0.153 ns/op [java] StringsEncodeBench.sizeOfString avgt6 283.413 ± 24.614 ns/op {code} > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976583#comment-16976583 ] Aleksey Yeschenko commented on CASSANDRA-15410: --- While you are at it, maybe update {{encodedSize()}} implementation as well to use the faster {{sizeOfAsciiString()}} - if not for performance then for symmetry? > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975536#comment-16975536 ] Yifan Cai commented on CASSANDRA-15410: --- Good idea! I have updated the PR accordingly. The benchmark result indicates the new {{CBUtil.writeAsciiString}} method has a similar performance. {code:java} [java] Benchmark Mode Cnt ScoreError Units [java] StringsEncodeBench.writeLongTextavgt6 570.920 ± 21.376 ns/op [java] StringsEncodeBench.writeLongTextAsASCII avgt6 291.466 ± 9.508 ns/op [java] StringsEncodeBench.writeLongTextWithExactSize avgt6 467.222 ± 25.140 ns/op [java] StringsEncodeBench.writeLongTextWithExactSizeSkipCalc avgt6 285.320 ± 10.883 ns/op [java] StringsEncodeBench.writeShortText avgt6 62.076 ± 2.107 ns/op [java] StringsEncodeBench.writeShortTextAsASCIIavgt6 32.121 ± 0.403 ns/op [java] StringsEncodeBench.writeShortTextWithExactSize avgt6 41.929 ± 1.783 ns/op [java] StringsEncodeBench.writeShortTextWithExactSizeSkipCalc avgt6 34.638 ± 0.455 ns/op {code} Changes included in the latest commit: * added {{CBUtil.writeAsciiString}}. * updated the code to call the new method whenever applicable, i.e. the string is certain to only contain char from (0, 127]. * added unit test cases for the new method. One thing to note is that {{ByteBufUtil#writeAscii()}} allows the extend 8-bits ASCII character, meanwhile, cassandra's ASCII data type follows the US-ASCII, which is 7-bits. If the input string contains any character with value > {{0x007F}}, {{ByteBufUtil#writeAscii()}} is going to encode it as 1 byte. However, cassandra's encoding considers it as 2 bytes. It leads to errors when reading the string. Client that calls {{CBUtil.writeAsciiString}} needs to be certain that the data type is ascii. I tried to add the validation step to check if the characters are US-ASCII compatible before writing, and add a boolean parameter to skip validation for the cases that I am sure the string is compatible. It ends up with all references of the method have the parameter set to skip, which looks slightly messy. I think for the places we want to use {{writeAsciiString}} over {{writeString}}, we already know the encoding of the string. So the parameter was omitted. Last but not the least, the root issue is that the {{ecodeSize}} is not a strong constraint when encoding. The encoding method cannot safely rely on the information. One improvement could be automate the calculation of the message size. Right now, the size was calculated manually by adding up the size of each field and *it is really error-prone*. Surely, it is outside the scope of this ticket. > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975077#comment-16975077 ] Aleksey Yeschenko commented on CASSANDRA-15410: --- I don't want to trust {{encodedSize()}} implementations, really (we had mismatch bugs in those calculations before). Now, if you inspect the usages of {{CBUtil#writeString()}} you'll notice that absolute most of those calls are performed strictly on ASCII strings. Keyspace names, table names, type names, function names, field names, enums - that can *only* be ASCII due to our grammar limitations. Of the strings in messages C* server writes - majority being in {{ResultSet.ResultMetadata}} - none of the strings ever go beyond ASCII. So we could introduce a special-case {{CBUtil#writeASCIIString()}} to use {{ByteBufUtil#writeAscii()}} and use that throughout, that should resolve the issue. And you'll get a further speed bump from the loop removal in ascii string size calculation in all the {{encodedSize()}} methods. > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974417#comment-16974417 ] Yifan Cai commented on CASSANDRA-15410: --- [~aleksey] thanks for reviewing. Understood the concern here. It assumes the method is/will be only used in the scenarios such as encoding message. We can have a separate method that explicitly takes the argument of reserved size, if encodeSize is a strong constraint and correct for message serialization, so it can be relied on. There is a big performance difference calculating the size and skipping calculation. Especially for the long strings, the time taken of the prior is 2x than the latter one. > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974246#comment-16974246 ] Aleksey Yeschenko commented on CASSANDRA-15410: --- I welcome this change in principle, but it makes me feel a little uneasy. Specifically, it relies implicitly on how the method is used two levels of abstraction up, and feels slightly brittle to me. I wonder if you'd be open instead to a slightly less efficient but, importantly, still resize-avoiding alternative: modify {{CBUtil#writeString()}} (and {{CBUtil#writeLongString()}} while at it, just for consistency sake) to explicitly calculate the precise size of encoded string using {{TypeSizes#encodedUTF8Length()}} method, and then pass that value to {{ByteBufUtil#reserveAndWriteUtf8()}}. This way, yes, you would effectively calculate the size of the string twice, but the abstraction won't leak assumptions, and will still avoid unnecessary resizing. > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971902#comment-16971902 ] Yifan Cai commented on CASSANDRA-15410: --- ||Branch||PR||Test|| |[CASSANDRA-15410|https://github.com/yifan-c/cassandra/tree/CASSANDRA-15410]|[PR|https://github.com/apache/cassandra/pull/382]|[Test|https://app.circleci.com/jobs/github/yifan-c/cassandra/56]| Given the fact that the encodeSize was calculated already when encoding, we can leverage the size and safely reserve the remaining capacity for writing to avoid resizing. A set of benchmarks were taken to show the difference. For the long text, the change halves the string encoding time from 571.9 ns to 216.1 ns. The time is almost halves for the short text as well. The improvement is because of avoiding the unnecessary resizing and data copy. {code:java} [java] Benchmark Mode Cnt ScoreError Units [java] Utf8StringEncodeBench.writeLongTextavgt6 571.949 ± 19.791 ns/op [java] Utf8StringEncodeBench.writeLongTextWithExactSize avgt6 459.932 ± 27.790 ns/op [java] Utf8StringEncodeBench.writeLongTextWithExactSizeSkipCalc avgt6 216.085 ± 3.480 ns/op [java] Utf8StringEncodeBench.writeShortText avgt6 62.775 ± 6.159 ns/op [java] Utf8StringEncodeBench.writeShortTextWithExactSize avgt6 44.071 ± 5.645 ns/op [java] Utf8StringEncodeBench.writeShortTextWithExactSizeSkipCalc avgt6 36.358 ± 5.135 ns/op {code} * writeLongText: the original implementation that calls ByteBufUtils.writeUtf8. It over-estimates the size of string that causes resizing the buffer. * writeLongTextWithExactSize: calls TypeSizes.encodeUTF8Length to reserve the exact size of bytes to write. * writeLongTextWithExactSizeSkipCalc: optimize by removing calculating the UTF8 length. Because we calculated the encodeSize before encode for messages. Therefore, the size of the final bytes is known, we can leverage this information to just reserve using the remaining capacity. > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Priority: Normal > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org