Hi,
Link: https://github.com/apache/incubator-hudi/issues/639
Sorry , failed open https://lists.apache.org/[email protected].
I have some follow up questions for issue 639:
So, the sequence of events is . We write parquet files and then upon
> successful writing of all attempted parquet files, we actually make the
> commit as completed. (i.e not inflight anymore). So this is normal. This is
> done to prevent queries from reading partially written parquet files..
>
Does that mean:
1. Some file inflight may never reach commit?
2. In occasion which inflight and parquet file generated by inflight still
exists, the global dedup will not dedup based on such kind file?
3. In occasion which inflight and parquet file generated by inflight still
exists, the correct query result will be decided by read config(I
mean mapreduce.input.pathFilter.class
in sparksql)
4. Is there any way we can use
> spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
> classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter],
> classOf[org.apache.hadoop.fs.PathFilter]);
in spark thrift server when start it?
Best,
--
[image: vshapesaqua11553186012.gif] <https://vungle.com/> *Jun Zhu*
Sr. Engineer I, Data
+86 18565739171
[image: in1552694272.png] <https://www.linkedin.com/company/vungle> [image:
fb1552694203.png] <https://facebook.com/vungle> [image:
tw1552694330.png] <https://twitter.com/vungle> [image:
ig1552694392.png] <https://www.instagram.com/vungle>
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China