[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-19 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977596#comment-16977596
 ] 

Aleksey Yeschenko commented on CASSANDRA-15410:
---

Committed to {{trunk}} as 
[647bdd6a11970f80666d7f20b53af76fbda4ff14|https://github.com/apache/cassandra/commit/647bdd6a11970f80666d7f20b53af76fbda4ff14],
 thanks. Made just one tiny tweak: switched encoding of UDT names and field 
types to ascii strings as well.

> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-18 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976776#comment-16976776
 ] 

Yifan Cai commented on CASSANDRA-15410:
---

Updated the PR with the {{sizeOfAsciiString()}} method. 

Since the input of the method is a US-ASCII string, the method simply returns 
{{2 (size) + str.length}}. Therefore, 2 orders of magnitude faster than 
{{sizeOfString()}} that iterates through the string.  

{code:java}
[java] Benchmark   Mode  Cnt
ScoreError  Units
[java] StringsEncodeBench.sizeOfAsciiStringavgt6
1.999 ±  0.153  ns/op
[java] StringsEncodeBench.sizeOfString avgt6  
283.413 ± 24.614  ns/op
{code}


> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-18 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976583#comment-16976583
 ] 

Aleksey Yeschenko commented on CASSANDRA-15410:
---

While you are at it, maybe update {{encodedSize()}} implementation as well to 
use the faster {{sizeOfAsciiString()}} - if not for performance then for 
symmetry?

> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-15 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975536#comment-16975536
 ] 

Yifan Cai commented on CASSANDRA-15410:
---

Good idea! I have updated the PR accordingly. The benchmark result indicates 
the new {{CBUtil.writeAsciiString}} method has a similar performance. 


{code:java}
[java] Benchmark   Mode  Cnt
ScoreError  Units
[java] StringsEncodeBench.writeLongTextavgt6  
570.920 ± 21.376  ns/op
[java] StringsEncodeBench.writeLongTextAsASCII avgt6  
291.466 ±  9.508  ns/op
[java] StringsEncodeBench.writeLongTextWithExactSize   avgt6  
467.222 ± 25.140  ns/op
[java] StringsEncodeBench.writeLongTextWithExactSizeSkipCalc   avgt6  
285.320 ± 10.883  ns/op
[java] StringsEncodeBench.writeShortText   avgt6   
62.076 ±  2.107  ns/op
[java] StringsEncodeBench.writeShortTextAsASCIIavgt6   
32.121 ±  0.403  ns/op
[java] StringsEncodeBench.writeShortTextWithExactSize  avgt6   
41.929 ±  1.783  ns/op
[java] StringsEncodeBench.writeShortTextWithExactSizeSkipCalc  avgt6   
34.638 ±  0.455  ns/op
{code}

Changes included in the latest commit:
* added {{CBUtil.writeAsciiString}}.
* updated the code to call the new method whenever applicable, i.e. the string 
is certain to only contain char from (0, 127].
* added unit test cases for the new method.

One thing to note is that {{ByteBufUtil#writeAscii()}} allows the extend 8-bits 
ASCII character, meanwhile, cassandra's ASCII data type follows the US-ASCII, 
which is 7-bits. If the input string contains any character with value > 
{{0x007F}}, {{ByteBufUtil#writeAscii()}} is going to encode it as 1 byte. 
However, cassandra's encoding considers it as 2 bytes. It leads to errors when 
reading the string. 
Client that calls {{CBUtil.writeAsciiString}} needs to be certain that the data 
type is ascii. 
I tried to add the validation step to check if the characters are US-ASCII 
compatible before writing, and add a boolean parameter to skip validation for 
the cases that I am sure the string is compatible. It ends up with all 
references of the method have the parameter set to skip, which looks slightly 
messy. 
I think for the places we want to use {{writeAsciiString}} over 
{{writeString}}, we already know the encoding of the string. So the parameter 
was omitted.

Last but not the least, the root issue is that the {{ecodeSize}} is not a 
strong constraint when encoding. The encoding method cannot safely rely on the 
information. One improvement could be automate the calculation of the message 
size. Right now, the size was calculated manually by adding up the size of each 
field and *it is really error-prone*. Surely, it is outside the scope of this 
ticket. 

> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-15 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975077#comment-16975077
 ] 

Aleksey Yeschenko commented on CASSANDRA-15410:
---

I don't want to trust {{encodedSize()}} implementations, really (we had 
mismatch bugs in those calculations before).

Now, if you inspect the usages of {{CBUtil#writeString()}} you'll notice that 
absolute most of those calls are performed strictly on ASCII strings. Keyspace 
names, table names, type names, function names, field names, enums - that can 
*only* be ASCII due to our grammar limitations.

Of the strings in messages C* server writes - majority being in 
{{ResultSet.ResultMetadata}} - none of the strings ever go beyond ASCII. So we 
could introduce a special-case {{CBUtil#writeASCIIString()}} to use 
{{ByteBufUtil#writeAscii()}} and use that throughout, that should resolve the 
issue. And you'll get a further speed bump from the loop removal in ascii 
string size calculation in all the {{encodedSize()}} methods.

> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-14 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974417#comment-16974417
 ] 

Yifan Cai commented on CASSANDRA-15410:
---

[~aleksey] thanks for reviewing. 

Understood the concern here. It assumes the method is/will be only used in the 
scenarios such as encoding message.

We can have a separate method that explicitly takes the argument of reserved 
size, if encodeSize is a strong constraint and correct for message 
serialization, so it can be relied on. There is a big performance difference 
calculating the size and skipping calculation. Especially for the long strings, 
the time taken of the prior is 2x than the latter one.

> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-14 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974246#comment-16974246
 ] 

Aleksey Yeschenko commented on CASSANDRA-15410:
---

I welcome this change in principle, but it makes me feel a little uneasy. 
Specifically, it relies implicitly on how the method is used two levels of 
abstraction up, and feels slightly brittle to me.

I wonder if you'd be open instead to a slightly less efficient but, 
importantly, still resize-avoiding alternative: modify {{CBUtil#writeString()}} 
(and {{CBUtil#writeLongString()}} while at it, just for consistency sake) to 
explicitly calculate the precise size of encoded string using 
{{TypeSizes#encodedUTF8Length()}} method, and then pass that value to 
{{ByteBufUtil#reserveAndWriteUtf8()}}. This way, yes, you would effectively 
calculate the size of the string twice, but the abstraction won't leak 
assumptions, and will still avoid unnecessary resizing.

> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

2019-11-11 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971902#comment-16971902
 ] 

Yifan Cai commented on CASSANDRA-15410:
---

||Branch||PR||Test||
|[CASSANDRA-15410|https://github.com/yifan-c/cassandra/tree/CASSANDRA-15410]|[PR|https://github.com/apache/cassandra/pull/382]|[Test|https://app.circleci.com/jobs/github/yifan-c/cassandra/56]|

Given the fact that the encodeSize was calculated already when encoding, we can 
leverage the size and safely reserve the remaining capacity for writing to 
avoid resizing.

A set of benchmarks were taken to show the difference. For the long text, the 
change halves the string encoding time from 571.9 ns to 216.1 ns. The time is 
almost halves for the short text as well.

The improvement is because of avoiding the unnecessary resizing and data copy.

{code:java}
[java] Benchmark  Mode  Cnt
ScoreError  Units
[java] Utf8StringEncodeBench.writeLongTextavgt6  
571.949 ± 19.791  ns/op
[java] Utf8StringEncodeBench.writeLongTextWithExactSize   avgt6  
459.932 ± 27.790  ns/op
[java] Utf8StringEncodeBench.writeLongTextWithExactSizeSkipCalc   avgt6  
216.085 ±  3.480  ns/op
[java] Utf8StringEncodeBench.writeShortText   avgt6   
62.775 ±  6.159  ns/op
[java] Utf8StringEncodeBench.writeShortTextWithExactSize  avgt6   
44.071 ±  5.645  ns/op
[java] Utf8StringEncodeBench.writeShortTextWithExactSizeSkipCalc  avgt6   
36.358 ±  5.135  ns/op
{code}

* writeLongText: the original implementation that calls ByteBufUtils.writeUtf8. 
It over-estimates the size of string that causes resizing the buffer.
* writeLongTextWithExactSize: calls TypeSizes.encodeUTF8Length to reserve the 
exact size of bytes to write.
* writeLongTextWithExactSizeSkipCalc: optimize by removing calculating the UTF8 
length. Because we calculated the encodeSize before encode for messages. 
Therefore, the size of the final bytes is known, we can leverage this 
information to just reserve using the remaining capacity.


> Avoid over-allocation of bytes for UTF8 string serialization 
> -
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Messaging/Client
>Reporter: Yifan Cai
>Priority: Normal
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org