[ https://issues.apache.org/jira/browse/FLINK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681911#comment-14681911 ]
Chesnay Schepler commented on FLINK-2501: ----------------------------------------- yes. modified (2) would mean that you have a Tuple2<TupleX<byte[],...>, byte[]>. > [py] Remove the need to specify types for transformations > --------------------------------------------------------- > > Key: FLINK-2501 > URL: https://issues.apache.org/jira/browse/FLINK-2501 > Project: Flink > Issue Type: Improvement > Components: Python API > Reporter: Chesnay Schepler > > Currently, users of the Python API have to provide type arguments when using > a UDF, like so: > {code} > d1.map(Mapper(), (INT, STRING)) > {code} > Instead, it would be really convenient to be able to do this: > {code} > d1.map(Mapper()) > {code} > The intention behind this issue is convenience, and it's also not really > pythonic to specify types. > Before I'll go into possible solutions, let me summarize the way these type > arguments are currently used, and in general how types are handled: > The type argument passed is actually an object of the type it represents, as > INT is a constant int value, whereas STRING is a constant string value. You > could as well write the following and it would still work. > {code} > d1.map(Mapper(), (1, "ImNotATypInfo")) > {code} > This object is transmitted to the java side during the plan binding (and is > now an actual Tuple2<Integer, String>), then passed to the type extractor, > and the resulting TypeInformation saved in the java counterpart of the udf, > which all implement the ResultTypeQueryable interface. > The TypeInformation object is only used by the Java API, python never touches > it. Instead, at runtime, the serializers used between python and java check > the classes of the values passed and are thus generated dynamically. > This means that, if a UDF does not pass the type it claims to pass, the > Python API wont complain, but the underlying java API will when it's > serializers fail. > Now let's talk solutions. > In discussions on the mailing list, pretty much 2 proposals were made: > # Add a way to disable/circumvent type checks during the plan phase in the > Java API and generate serializers dynamically. > # Have objects always in serialized form on the java side, stored in a single > bytearray or Tuple2 containing a key/value pair. > These proposals vary wildly in the changes necessary to the system: > # "How can we change the Java API to support this?" > This proposal would hardly change the way the Python API works, or even touch > the related source code. It mostly deals with the Java API. Since I'm not to > familiar with the Plan processing life-cycle on the java side I can't assess > which classes would have to be changed. > # "How can we make this work within the limits of the Java API?" > is the exact opposite, it changes nothing in the Java API. Instead, the > following issues would have to be solved: > * Alter the plan to extract keys before keyed operations, while hiding these > keys from the UDF. This is exactly how KeySelectors (will) work, and as such > is generally solved. In fact, this solution would make a few things easier in > regards to KeySelectors. > * Rework all operations that currently rely on Java API functions, that need > deserialized data, for example Projections or the upcoming Aggregations; > This generally means implementing them in python, or with special java UDF's > (they could de-/serialize data within the udf call, or work on serialized > data). > * Change (De)Serializers accordingly > * implement a reliable, not all-memory-consuming sorting mechanism on the > python side > Personally i prefer the second option, as it > # does not modify the Java API, it works within it's well-tested limits > # Plan changes are similar to issues that are already worked on (KeySelectors) > # Sorting implementation was necessary anyway (for chained reducers) > # having data in serialized form was a performance-related consideration > already > While the first option could work, and most likely require less work, i feel > like many of the things required for option 2 will be implemented eventually > anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332)