nandini57 edited a comment on issue #1582:
URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624789029


   Probably switching to parquet format instead of hudi and doing a 
spark.read.parquet(partitionpath).dropduplicates where commit_time= X is an 
option? The following works if i want to go back to commit X and have a view of 
data.However,the same with hudi format doesn't provide me the right view as of 
commit X
   
    def audit(spark: SparkSession, partitionPath: String, tablePath: String, 
commitTime: String): Unit = {
       val hoodieROViewDF = spark.read.option("inferSchema", 
true).parquet(tablePath +  "/" + partitionPath)
       hoodieROViewDF.createOrReplaceTempView("hoodie_ro")
       spark.sql("select * from hoodie_ro where _hoodie_commit_time =" + 
commitTime).dropDuplicates().show()
     }
   
   Did a lil bit digging and the following code in HoodieROTablePathFilter 
seems to be taking only latest BaseFile and thus dropping the other files.The 
impact of this is in my case ,i get incorrect view as of time X as it is 
reading latest file which has 2 records as of time X and 1 is upserted and got 
a new commit time.Is the understanding correct?
   
   How do i get around this?Can i use a custom path filter?
   
       HoodieTableMetaClient metaClient = new 
HoodieTableMetaClient(fs.getConf(), baseDir.toString());
           HoodieTableFileSystemView fsView = new 
HoodieTableFileSystemView(metaClient,
                   
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), 
fs.listStatus(folder));
           List<HoodieBaseFile> latestFiles = 
fsView.getLatestBaseFiles().collect(Collectors.toList());


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to