so-lazy commented on issue #2338: URL: https://github.com/apache/hudi/issues/2338#issuecomment-751363149
> @so-lazy :would u mind elaborating more on your use-case. did you choose Global_bloom intentionally? > And by this statement of yours "i found much duplicate records,..", did you mean to insinuate that compaction hasn't happened and hence you found duplicates or did you refer in general your dataset has duplicates? > Do you want to do dedup for your use-case in general? This is my mysql table **id** | name | add_time 1 | "so-lazy" | 2020-12-26 yeah, for my use case, i consume binlog delta data from kafka and those data they have primary key "id", i set dt based on the column add_time, and format is yyyy-MM-dd. What i want is if one row id is 1 and at the first time hudi upserts this row into table partitiion 2020-12-26, then next time this id 1 come again,it will be updated. But now what i found is , when i use spark-sql search "select * from table where id = 1" on hive table synced by hudi, i saw many the same records they are totally the same, all the column values they are the same, also in the same parquet file, but when i read this parquet file by spark, i can only find one row which id = 1 in that parquet file. Sir do u understand what i mean, thanks sooo much. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
