[
https://issues.apache.org/jira/browse/FLINK-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787763#comment-16787763
]
vinoyang commented on FLINK-11818:
----------------------------------
Hi [~hequn8128] , In fact, my idea is not much different from the current
implementation of Spark.
1) We can provide multiple overloaded methods called pipe for the DataSet
object. E.g, p{{ipe(String cmd)/pipe(String cmd, Map<String, String> env)...}},
Flink inputs the external program and gets the output of the external program
as a new DataSet. [1] [2]
2) I think its semantics are similar to Spark.
[1]:
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala]
[2]:
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala]
What do you think? cc [~fhueske] [~till.rohrmann]
> Provide pipe transformation function for DataSet API
> ----------------------------------------------------
>
> Key: FLINK-11818
> URL: https://issues.apache.org/jira/browse/FLINK-11818
> Project: Flink
> Issue Type: Improvement
> Components: API / DataSet
> Reporter: vinoyang
> Assignee: vinoyang
> Priority: Major
>
> We have some business requirements that require the data handled by Flink to
> interact with some external programs (such as Python/Perl/shell scripts).
> There is no such function in the existing DataSet API, although it can be
> implemented by the map function, but it is not concise. It would be helpful
> if we could provide a pipe[1] function like Spark.
> [1]:
> https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)