Github user fhueske commented on the pull request:

    https://github.com/apache/flink/pull/983#issuecomment-127764809
  
    I tried to write a DataSet program that processes Tuple0 records and fails 
during execution. However, this is surprisingly hard. In fact, I didn't manage 
to break the system.
    
    The reason why we argued that Tuple0 is unsafe is because de/serialization 
does not read/write anything and the byte stream is not forwarded. Hence you 
could "read" a million Tuple0 objects without advancing the stream. This is 
only relevant for DataSet that consist completely of types that have this 
behavior, because as soon there is another type with proper de/serialization 
the stream is advanced. 
    
    I was surprised when @mjsax said, that a job that shuffled Tuple0 records 
(which means de/serialization) worked. I verified that and my guess would be 
that the network stack writes the number of records into its network buffer and 
stops deserializing after all records are deserialized (not when the byte 
stream is EOF). Hence, network serialization works even if no bytes are 
shipped. @uce or @StephanEwen might confirm this.
    
    Another reason to de/serialize data is for sorting, grouping, joining, or 
crossing. Most transformation that require a key (there are some inconsistency 
wrt to key handling in the API) do not allow Tuple0 keys and fail with a good 
error message when the program is constructed (not at runtime). The API does 
not allow to sort, join, or group a data set on a Tuple0. I thought that using 
a cross transformation would let a program fail while executing because it does 
not require a key and should serialize data, however it worked. 
    
    The system seems to be more robust than I expected. Nonetheless, I am still 
a bit skeptical about this change and would like to learn why network transfer 
and crossing worked so well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to