nandini57 edited a comment on issue #1582: URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624789029
Probably switching to parquet format instead of hudi and doing a spark.read.parquet(partitionpath).dropduplicates where commit_time= X is an option? The following works if i want to go back to commit X and have a view of data.However,the same with hudi format doesn't provide me the right view as of commit X def audit(spark: SparkSession, partitionPath: String, tablePath: String, commitTime: String): Unit = { val hoodieROViewDF = spark.read.option("inferSchema", true).parquet(tablePath + "/" + partitionPath) hoodieROViewDF.createOrReplaceTempView("hoodie_ro") spark.sql("select * from hoodie_ro where _hoodie_commit_time =" + commitTime).dropDuplicates().show() } Did a lil bit digging and the following code in HoodieROTablePathFilter seems to be taking only latest BaseFile and thus dropping the other files.The impact of this is in my case ,i get incorrect view as of time X as it is reading latest file which has 2 records as of time X and 1 is upserted and got a new commit time.Is the understanding correct? How do i get around this?Can i use a custom path filter? HoodieTableMetaClient metaClient = new HoodieTableMetaClient(fs.getConf(), baseDir.toString()); HoodieTableFileSystemView fsView = new HoodieTableFileSystemView(metaClient, metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), fs.listStatus(folder)); List<HoodieBaseFile> latestFiles = fsView.getLatestBaseFiles().collect(Collectors.toList()); ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org