[GitHub] [hudi] so-lazy commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

GitBox Sat, 26 Dec 2020 06:49:40 -0800


so-lazy commented on issue #2338:
URL: https://github.com/apache/hudi/issues/2338#issuecomment-751363149



   > @so-lazy :would u mind elaborating more on your use-case. did you choose 
Global_bloom intentionally?
   > And by this statement of yours "i found much duplicate records,..", did 
you mean to insinuate that compaction hasn't happened and hence you found 
duplicates or did you refer in general your dataset has duplicates?
   > Do you want to do dedup for your use-case in general?
   
   This is my mysql table
   **id** | name | add_time
   1  | "so-lazy" | 2020-12-26
   
   
   yeah, for my use case, i consume binlog delta data from kafka and those data 
they have primary key "id", i set dt based on the column add_time,  and format 
is yyyy-MM-dd. What i want is if one row id is 1 and at the first time hudi  
upserts this row into table partitiion 2020-12-26, then next time this id 1 
come again,it will be updated. 
   
   But now what i found is , when i use spark-sql search "select * from table 
where id = 1" on hive table synced by hudi, i saw many the same records they 
are totally the same, all the column values  they are  the same, also in the 
same parquet file, but when i read this parquet file by spark, i can only find 
one row which id = 1 in that parquet file.
   Sir do u understand what i mean, thanks sooo much.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] so-lazy commented on issue #2338: [SUPPORT] MOR table found duplicate and process so slowly

Reply via email to