[GitHub] [hudi] noobarcitect opened a new issue #2461: All records are present in athena query result on glue crawled Hudi tables

GitBox Tue, 19 Jan 2021 06:22:01 -0800


noobarcitect opened a new issue #2461:
URL: https://github.com/apache/hudi/issues/2461



   We are in the POC stage of implementing apache hudi in our existing AWS 
datalake and pipeline. There is one issue that we are stuck at. The issue is as 
follows : 
   1. We inserted a record into hudi table on COW mode. And then we made an 
upsert updating that record initially inserted.
   2. Now this Hudi table gets crawled through aws glue crawler.
   3. If we try to read the table from Athena, we get all 3 records. But what 
we want is only the latest delta record in athena query.
   4. One reason we came across is that glue reads the hudi files as parquet 
files and read the inputformat as MapReduceParquetFormat rather than 
hoddieParquet format.
   
   Q: Will there be a support in glue crawlers to identify the hoodieparquet 
format as input format ? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] noobarcitect opened a new issue #2461: All records are present in athena query result on glue crawled Hudi tables

Reply via email to