Zhujun-Vungle edited a comment on issue #639: About global dedup, find some commit who keep inflight and still generate parquet file and fail dedup URL: https://github.com/apache/incubator-hudi/issues/639#issuecomment-484339016 Hi Sorry , failed open https://lists.apache.org/[email protected]. I have some follow up questions for this issue: > So, the sequence of events is . We write parquet files and then upon successful writing of all attempted parquet files, we actually make the commit as completed. (i.e not inflight anymore). So this is normal. This is done to prevent queries from reading partially written parquet files.. Does that mean: 1. Some file inflight may never reach commit? 2. In occasion which inflight and parquet file generated by inflight still exists, the global dedup will not dedup based on such kind file? 3. In occasion which inflight and parquet file generated by inflight still exists, the correct query result will be decided by read config(I mean mapreduce.input.pathFilter.class in sparksql) 4. Is there any way we can use spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]); in spark thrift server when start it? Best,
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
