[ https://issues.apache.org/jira/browse/FLINK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15224692#comment-15224692 ]
Jamie Grier commented on FLINK-3679: ------------------------------------ I'm not sure about the locking and operator chaining issues so I would say if that's unduly complicated because of this change maybe it's not worth it. However, a DeserializationSchema with more flatMap() like semantics would certainly the better API given that bad data issues are a reality. It also seems we could provide this without breaking existing code, but certainly it would add a bit more complexity to the API (having multiple variants for this). Anyway, I agree you can work around this issue my making a special "sentinel" value and dealing with all of this is in a chained flatMap() operator. I imagine that's exactly the approach that people are already using. > DeserializationSchema should handle zero or more outputs for every input > ------------------------------------------------------------------------ > > Key: FLINK-3679 > URL: https://issues.apache.org/jira/browse/FLINK-3679 > Project: Flink > Issue Type: Bug > Components: DataStream API > Reporter: Jamie Grier > > There are a couple of issues with the DeserializationSchema API that I think > should be improved. This request has come to me via an existing Flink user. > The main issue is simply that the API assumes that there is a one-to-one > mapping between input and outputs. In reality there are scenarios where one > input message (say from Kafka) might actually map to zero or more logical > elements in the pipeline. > Particularly important here is the case where you receive a message from a > source (such as Kafka) and say the raw bytes don't deserialize properly. > Right now the only recourse is to throw IOException and therefore fail the > job. > This is definitely not good since bad data is a reality and failing the job > is not the right option. If the job fails we'll just end up replaying the > bad data and the whole thing will start again. > Instead in this case it would be best if the user could just return the empty > set. > The other case is where one input message should logically be multiple output > messages. This case is probably less important since there are other ways to > do this but in general it might be good to make the > DeserializationSchema.deserialize() method return a collection rather than a > single element. > Maybe we need to support a DeserializationSchema variant that has semantics > more like that of FlatMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)