[ 
https://issues.apache.org/jira/browse/FLINK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler reassigned FLINK-2501:
---------------------------------------

    Assignee: Chesnay Schepler

> [py] Remove the need to specify types for transformations
> ---------------------------------------------------------
>
>                 Key: FLINK-2501
>                 URL: https://issues.apache.org/jira/browse/FLINK-2501
>             Project: Flink
>          Issue Type: Improvement
>          Components: Python API
>            Reporter: Chesnay Schepler
>            Assignee: Chesnay Schepler
>
> Currently, users of the Python API have to provide type arguments when using 
> a UDF, like so:
> {code}
> d1.map(Mapper(), (INT, STRING))
> {code}
> Instead, it would be really convenient to be able to do this:
> {code}
> d1.map(Mapper())
> {code}
> The intention behind this issue is convenience, and it's also not really 
> pythonic to specify types.
> Before I'll go into possible solutions, let me summarize the way these type 
> arguments are currently used, and in general how types are handled:
> The type argument passed is actually an object of the type it represents, as 
> INT is a constant int value, whereas STRING is a constant string value. You 
> could as well write the following and it would still work.
> {code}
> d1.map(Mapper(), (1, "ImNotATypInfo"))
> {code}
> This object is transmitted to the java side during the plan binding (and is 
> now an actual Tuple2<Integer, String>), then passed to the type extractor, 
> and the resulting TypeInformation saved in the java counterpart of the udf, 
> which all implement the ResultTypeQueryable interface. 
> The TypeInformation object is only used by the Java API, python never touches 
> it. Instead, at runtime, the serializers used between python and java check 
> the classes of the values passed and are thus generated dynamically.
> This means that, if a UDF does not pass the type it claims to pass, the 
> Python API wont complain, but the underlying java API will when it's 
> serializers fail.
> Now let's talk solutions.
> In discussions on the mailing list, pretty much 2 proposals were made:
> # Add a way to disable/circumvent type checks during the plan phase in the 
> Java API and generate serializers dynamically.
> # Have objects always in serialized form on the java side, stored in a single 
> bytearray or Tuple2 containing a key/value pair.
> These proposals vary wildly in the changes necessary to the system:
> # "How can we change the Java API to support this?"
> This proposal would hardly change the way the Python API works, or even touch 
> the related source code. It mostly deals with the Java API. Since I'm not to 
> familiar with the Plan processing life-cycle on the java side I can't assess 
> which classes would have to be changed.
> # "How can we make this work within the limits of the Java API?"
> is the exact opposite, it changes nothing in the Java API. Instead, the 
> following issues would have to be solved:
> * Alter the plan to extract keys before keyed operations, while hiding these 
> keys from the UDF. This is exactly how KeySelectors (will) work, and as such 
> is generally solved. In fact, this solution would make a few things easier in 
> regards to KeySelectors.
> * Rework all operations that currently rely on Java API functions, that need 
> deserialized data, for example Projections or the upcoming Aggregations; 
> This generally means implementing them in python, or with special java UDF's 
> (they could de-/serialize data within the udf call, or work on serialized 
> data).
> * Change (De)Serializers accordingly
> * implement a reliable, not all-memory-consuming sorting mechanism on the 
> python side
> Personally i prefer the second option, as it
> # does not modify the Java API, it works within it's well-tested limits
> # Plan changes are similar to issues that are already worked on (KeySelectors)
> # Sorting implementation was necessary anyway (for chained reducers)
> # having data in serialized form was a performance-related consideration 
> already
> While the first option could work, and most likely require less work, i feel 
> like many of the things required for option 2 will be implemented eventually 
> anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to