vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-692257956


   That sounds promising. We cannot easily fix `T,R,O` though right at the 
class level. it will be different in each setting, right?  
   
   We can use generic methods though I think and what I had in mind was 
something little different
   if the method signature can be the following 
   
   ```
   public abstract <I,O> List<O> parallelDo(List<I> data, Function<I, O> func, 
int parallelism); 
   ```
   
   and for the Spark the implementation 
   
   ```
   public <I,O> List<O> parallelDo(List<I> data, Function<I, O> func, int 
parallelism) {
      return jsc.parallelize(data, parallelism).map(func).collectAsList();
   }
   ```
   
   and the invocation 
   
   ```
   engineContext.parallelDo(Arrays.asList(1,2,3), (a) -> {
         return 0;
       }, 3);
   
   ```
   
   Can you a small Flink implementation as well and confirm this approach will 
work for Flink as well. Thats another key thing to answer, before we finalize 
the approach. 
   
   
   Note that 
   
   a) some of the code that uses `rdd.mapToPair()` etc before collecting, have 
to be rewritten using Java Streams. i.e 
   
   `engineContext.parallelDo().map()` instead of 
`jsc.parallelize().map().mapToPair()` currently.  
   
   b) Also we should limit the use of parallelDo() to only cases where there 
are no grouping/aggregations involved. for e.g if we are doing 
`jsc.parallelize(files, ..).map(record).reduceByKey(..)`, collecting without 
reducing will lead to OOMs. We can preserve what you are doing currently, for 
such cases.  
   
   But the parallelDo should help up reduce the amount of code broken up (and 
thus reduce the effort for the flink engine). 
   
   
   @wangxianghu I am happy to shepherd this PR through from this point as well. 
lmk 
   
   cc @yanghua @leesf 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to