AHeise commented on a change in pull request #10358: [FLINK-14346] 
[serialization] faster implementation of StringValue writeString and readString
URL: https://github.com/apache/flink/pull/10358#discussion_r353351658
 
 

 ##########
 File path: flink-core/src/main/java/org/apache/flink/types/StringValue.java
 ##########
 @@ -759,56 +761,142 @@ public static String readString(DataInput in) throws 
IOException {
                        }
                        len |= curr << shift;
                }
-               
+
                // subtract one for the null length
                len -= 1;
-               
-               final char[] data = new char[len];
 
-               for (int i = 0; i < len; i++) {
-                       int c = in.readUnsignedByte();
-                       if (c < HIGH_BIT) {
-                               data[i] = (char) c;
-                       } else {
+               /* as we have no idea about byte-length of the serialized 
string, we cannot fully
+                * read it into memory buffer. But we can do it in an 
optimistic way:
+                * 1. In a happy case when the string is an us-ascii one, then 
byte_len == char_len
+                * 2. If we spot at least one character with code >= 127, then 
we reallocate the buffer
+                * to accommodate for the next characters.
+                */
+
+               // happily assume that the string is an 7 bit us-ascii one
+               byte[] buf = new byte[len];
+               in.readFully(buf);
+
+               final char[] data = new char[len];
+               int charPosition = 0;
+               int bufSize = len;
+               int bytePosition = 0;
+
+               while (charPosition < len) {
+                       // there is at least `char count - char position` bytes 
left in case if all the
+                       // remaining characters are 7 bit.
+                       int remainingBytesEstimation = len - charPosition;
 
 Review comment:
   This is less of an estimation and more of a lower bound, right? Estimation 
would be something like 
   `remainingBytes = <usedBytes> / charPosition * len` where we extrapolate the 
byte usage of the previous letters. We cannot do an estimation, however, since 
we may read smaller letters in the remaining path leading to some EOI 
exception. So the only sane way to reuse the buffer is to use the lower bound 
that assumes that all remaining letters take 1 byte. Your comment is pointing 
that perfectly out.
   So to reflect that, how about renaming the variable into `minRemainingBytes` 
or `remainingLetters`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to