[ 
https://issues.apache.org/jira/browse/FLINK-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787763#comment-16787763
 ] 

vinoyang commented on FLINK-11818:
----------------------------------

Hi [~hequn8128] , In fact, my idea is not much different from the current 
implementation of Spark.

1) We can provide multiple overloaded methods called pipe for the DataSet 
object. E.g, p{{ipe(String cmd)/pipe(String cmd, Map<String, String> env)...}}, 
 Flink inputs the external program and gets the output of the external program 
as a new DataSet. [1]  [2]

2) I think its semantics are similar to Spark.

 

[1]: 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala]

[2]: 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala]

 

What do you think? cc [~fhueske] [~till.rohrmann]

 

> Provide pipe transformation function for DataSet API
> ----------------------------------------------------
>
>                 Key: FLINK-11818
>                 URL: https://issues.apache.org/jira/browse/FLINK-11818
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / DataSet
>            Reporter: vinoyang
>            Assignee: vinoyang
>            Priority: Major
>
> We have some business requirements that require the data handled by Flink to 
> interact with some external programs (such as Python/Perl/shell scripts). 
> There is no such function in the existing DataSet API, although it can be 
> implemented by the map function, but it is not concise. It would be helpful 
> if we could provide a pipe[1] function like Spark.
> [1]: 
> https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to