Zhujun-Vungle edited a comment on issue #639: About global dedup, find some 
commit who keep inflight and still generate parquet file and fail dedup
URL: https://github.com/apache/incubator-hudi/issues/639#issuecomment-484339016
 
 
   Hi
   Sorry , failed open https://lists.apache.org/[email protected]. 
 
   I have some follow up questions for this issue:
   
   > So, the sequence of events is . We write parquet files and then upon 
successful writing of all attempted parquet files, we actually make the commit 
as completed. (i.e not inflight anymore). So this is normal. This is done to 
prevent queries from reading partially written parquet files..
   
   Does that mean: 
   1. Some file inflight may never reach commit? 
   2. In occasion which inflight and parquet file generated by inflight still 
exists,  the global dedup will not dedup based on such kind file?
   3. In occasion which inflight and parquet file generated by inflight still 
exists,  the correct query result will be decided by read config(I mean 
mapreduce.input.pathFilter.class in sparksql)
   4. Is there any way we can use 
   
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[com.uber.hoodie.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
   in spark thrift server when start it? 
   
   Best,
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to