wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-692074997
> I can help address the remaining feedback. I will push a small diff
today/tmrw.
> Overall, looks like a reasonable start.
>
> The major feedback I still have is the following
>
> > would a parallelDo(func, parallelism) method in HoodieEngineContext help
us avoid a lot of base/child class duplication of logic like this?
>
> Lot of usages are like `jsc.parallelize(list, parallelism).map(func)` ,
which all require a base-child class now. I am wondering if its easier to take
those usages alone and implement as `engineContext.parallelDo(list, func,
parallelism)`. This can be the lowest common denominator across Spark/Flink
etc. We can avoid splitting a good chunk of classes if we do this IMO. If this
is interesting, and we agree, I can try to quantify.
Hi @vinothchandar, how about this demo?
`public abstract class AbstractHoodieEngineContext<T, R, O> {
public abstract O parallelDo(List<T> datas, Function<List<T>, R> function,
int parallesiom);
}`
`public class HoodieSparkEngineContext extends
AbstractHoodieEngineContext<Integer,List<String>, JavaRDD<String>> {
private static JavaSparkContext jsc;
static {
SparkConf conf = new SparkConf()
.setMaster("local[4]")
.set("spark.driver.host","localhost")
.setAppName("SparkFunction");
jsc = new JavaSparkContext(conf);
}
@Override
public JavaRDD<String> parallelDo(List<Integer> datas,
Function<List<Integer>, List<String>> function, int parallesiom) {
return jsc.parallelize(function.apply(datas), parallesiom);
}
}`
`public class App {
public static void main(String[] args) {
// define function
Function<List<Integer>, List<String>> function = integers ->
integers.stream().map(x -> x + " ####").collect(Collectors.toList());
// inputs
List<Integer> inputs = Arrays.asList(1, 2, 3, 4, 5, 6, 7);
// context
AbstractHoodieEngineContext<Integer, List<String>, JavaRDD<String>>
sparkEngineContext = new HoodieSparkEngineContext();
// execute
JavaRDD<String> stringJavaRdd = sparkEngineContext.parallelDo(inputs,
function, 3);
// print result
stringJavaRdd.foreach(x -> System.out.println(x));
}
}`
The result are as follows:
`3 ####
4 ####
1 ####
2 ####
5 ####
6 ####
7 ####`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]