shuttie commented on issue #10358: [FLINK-14346] [serialization] faster implementation of StringValue writeString and readString URL: https://github.com/apache/flink/pull/10358#issuecomment-561581049 @AHeise thanks for all the ideas, I've updated the PR with all the proposals applied. As for `writeString` fallback code, I've found a better way of dealing with short strings, not requiring a separate code path. If you stare long enough in the jmh perfasm listing for short strings, you may notice that most of the time (compared with the original implementation) is spent within initial buffer size computation. In the original unbuffered code there is no reason to compute it, as there is no buffer. But in this PR we need to scan a string twice: to compute the buffer size, and then to write characters to the buffer. Main idea of this PR is to leverage CPU-level parallelism, helping it to process multiple characters at once. But the problem with short strings is that there is nothing to parallelize, so double-scanning overhead starts to kill the performance. The proposed fix is to over-allocate the buffer for short strings, skipping the exact buffer size computation. I've found a tipping point for this approach laying somewhere between 6-8 characters: * for strings < 6 chars it's faster to overallocate, * for strings of 6-8 chars it's the same as exact computation, * for strings > 8 chars it can be slower, but insignificantly. But in theory it may produce some GC pressure. The current round of benchmarks: ``` [info] Benchmark (length) (stringType) Mode Cnt Score Error Units [info] StringDeserializerBenchmark.deserializeDefault 1 ascii avgt 50 45.618 ± 0.339 ns/op [info] StringDeserializerBenchmark.deserializeDefault 2 ascii avgt 50 61.348 ± 0.579 ns/op [info] StringDeserializerBenchmark.deserializeDefault 4 ascii avgt 50 88.067 ± 1.058 ns/op [info] StringDeserializerBenchmark.deserializeDefault 8 ascii avgt 50 142.902 ± 1.121 ns/op [info] StringDeserializerBenchmark.deserializeDefault 16 ascii avgt 50 249.181 ± 1.920 ns/op [info] StringDeserializerBenchmark.deserializeDefault 32 ascii avgt 50 466.382 ± 1.502 ns/op [info] StringDeserializerBenchmark.deserializeImproved 1 ascii avgt 50 49.916 ± 0.132 ns/op [info] StringDeserializerBenchmark.deserializeImproved 2 ascii avgt 50 50.278 ± 0.064 ns/op [info] StringDeserializerBenchmark.deserializeImproved 4 ascii avgt 50 50.365 ± 0.129 ns/op [info] StringDeserializerBenchmark.deserializeImproved 8 ascii avgt 50 52.463 ± 0.301 ns/op [info] StringDeserializerBenchmark.deserializeImproved 16 ascii avgt 50 55.711 ± 0.597 ns/op [info] StringDeserializerBenchmark.deserializeImproved 32 ascii avgt 50 65.342 ± 0.555 ns/op [info] StringSerializerBenchmark.serializeDefault 1 ascii avgt 50 31.076 ± 0.192 ns/op [info] StringSerializerBenchmark.serializeDefault 2 ascii avgt 50 31.770 ± 1.811 ns/op [info] StringSerializerBenchmark.serializeDefault 4 ascii avgt 50 39.251 ± 0.189 ns/op [info] StringSerializerBenchmark.serializeDefault 8 ascii avgt 50 57.736 ± 0.253 ns/op [info] StringSerializerBenchmark.serializeDefault 16 ascii avgt 50 94.964 ± 0.514 ns/op [info] StringSerializerBenchmark.serializeDefault 32 ascii avgt 50 168.754 ± 1.416 ns/op [info] StringSerializerBenchmark.serializeImproved 1 ascii avgt 50 30.145 ± 0.156 ns/op [info] StringSerializerBenchmark.serializeImproved 2 ascii avgt 50 30.873 ± 0.274 ns/op [info] StringSerializerBenchmark.serializeImproved 4 ascii avgt 50 31.993 ± 0.276 ns/op [info] StringSerializerBenchmark.serializeImproved 8 ascii avgt 50 46.220 ± 0.211 ns/op [info] StringSerializerBenchmark.serializeImproved 16 ascii avgt 50 50.856 ± 0.826 ns/op [info] StringSerializerBenchmark.serializeImproved 32 ascii avgt 50 63.221 ± 1.130 ns/op ``` So for large strings the new implementation is much faster, and for short it's not regressing (and even slightly faster).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
