noobarcitect opened a new issue #2461: URL: https://github.com/apache/hudi/issues/2461
We are in the POC stage of implementing apache hudi in our existing AWS datalake and pipeline. There is one issue that we are stuck at. The issue is as follows : 1. We inserted a record into hudi table on COW mode. And then we made an upsert updating that record initially inserted. 2. Now this Hudi table gets crawled through aws glue crawler. 3. If we try to read the table from Athena, we get all 3 records. But what we want is only the latest delta record in athena query. 4. One reason we came across is that glue reads the hudi files as parquet files and read the inputformat as MapReduceParquetFormat rather than hoddieParquet format. Q: Will there be a support in glue crawlers to identify the hoodieparquet format as input format ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
