GitHub user tzulitai opened a pull request:
https://github.com/apache/flink/pull/4014
[FLINK-6763] [core] Make serialization of composite serializer configs more
efficient
This PR affects the serialization formats of configuration snapshots of
composite serializers, most notably the `PojoSerializer`, as well as others
such as `MapSerializer`, `GenericArraySerializer`, `TupleSerializer`, etc. It
also affects the serialization formats of the
`OperatorBackendSerializationProxy` and `KeyedBackendSerializationProxy`.
Prior to this PR, whenever we write a serializer and its config snapshot
into a checkpoint, we always write the start offset and end offset of the
serializer bytes, effectively indexing every serializer and its config. This
required buffering the whole list of serializer and config snapshot pairs when
writing the checkpoint.
This PR changes this to be more efficient by just writing the length of the
serializer bytes prior to writing the serializer. This also allows lesser
buffering for the writes.
## Implementation
Now, `TypeSerializerSerializationUtil` has the following methods for
writing / reading serializers:
- `writeSerializer`
- `tryReadSerializer`
- `writeSerializerWithResilience`
- `tryReadSerializerWithResilience`
The first two non-resilient variants remains as they were (not containing
write serializer length logic), and needs to remain untouched for backwards
compatibility (previous checkpoints do not contain the serializer length before
the serializer bytes). They are only used in code paths for backwards
compatibility.
All composite type serializers now use the latter two `*WithResilience`
variants.
## Affect on backwards compatibility
Backwards compatibility is still maintained for prior versions.
However, depending on whether or not this PR makes it into the 1.3.0
release, it may need to be changed more. The current state of the PR assumes
that it will be merged for 1.3.0.
If it misses it, the PR needs additional changes to have separate code
paths for the config snapshot reads of composite serializers, one for VERSION 1
which still uses offsets, and one for an upticked VERSION 2 which use the new
`*WithResilience` variants.
## Tests
Since the tests in `TypeSerializerSerializationUtilTest` and
`PojoSerializerTest` already cover tests for resilience of serializer read
failures, and this PR only changes the way we store information to achieve the
same functionality, no new tests are added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tzulitai/flink FLINK-6763
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4014.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4014
----
commit 61c7ba2bb2d6dbe750227815593c1c4d952556d4
Author: Tzu-Li (Gordon) Tai <[email protected]>
Date: 2017-05-29T18:07:04Z
[FLINK-6763] [core] Make serialization of composite serializer configs more
efficient
This commit affects the serialization formats of configuration snapshots
of composite serializers, most notably the PojoSerializer, as well as
others such as MapSerializer, GenericArraySerializer, TupleSerializer,
etc. It also affects the serialization formats of the
OperatorBackendSerializationProxy and KeyedBackendSerializationProxy.
Prior to this commit, whenever we write a serializer and its config
snapshot into a checkpoint, we always write the start offset and end
offset of the serializer bytes, effectively indexing every serializer
and its config. This required buffering the whole list of serializer and
config snapshot pairs when writing the checkpoint.
This commit changes this to be more efficient by just writing the length
of the serializer bytes prior to writing the serializer. This allows
lesser buffering for the writes.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---