Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/983#issuecomment-127764809
I tried to write a DataSet program that processes Tuple0 records and fails
during execution. However, this is surprisingly hard. In fact, I didn't manage
to break the system.
The reason why we argued that Tuple0 is unsafe is because de/serialization
does not read/write anything and the byte stream is not forwarded. Hence you
could "read" a million Tuple0 objects without advancing the stream. This is
only relevant for DataSet that consist completely of types that have this
behavior, because as soon there is another type with proper de/serialization
the stream is advanced.
I was surprised when @mjsax said, that a job that shuffled Tuple0 records
(which means de/serialization) worked. I verified that and my guess would be
that the network stack writes the number of records into its network buffer and
stops deserializing after all records are deserialized (not when the byte
stream is EOF). Hence, network serialization works even if no bytes are
shipped. @uce or @StephanEwen might confirm this.
Another reason to de/serialize data is for sorting, grouping, joining, or
crossing. Most transformation that require a key (there are some inconsistency
wrt to key handling in the API) do not allow Tuple0 keys and fail with a good
error message when the program is constructed (not at runtime). The API does
not allow to sort, join, or group a data set on a Tuple0. I thought that using
a cross transformation would let a program fail while executing because it does
not require a key and should serialize data, however it worked.
The system seems to be more robust than I expected. Nonetheless, I am still
a bit skeptical about this change and would like to learn why network transfer
and crossing worked so well.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---