aokolnychyi commented on pull request #1947: URL: https://github.com/apache/iceberg/pull/1947#issuecomment-748944819
Is there enough consensus on making the cardinality check optional to match Hive and to avoid an extra inner join for merge-on-read? I think it should be enabled by default to prevent correctness problems. I don't think we agreed on how to implement the cardinality check. I had some thoughts in [this](https://github.com/apache/iceberg/pull/1947#issuecomment-747450897) comment. @dilipbiswal @rdblue @RussellSpitzer, what is your take on this? How do you see it is implemented? @RussellSpitzer did mention a corner case where the accumulator approach consumes a lot of memory on the driver (if each executor has a substantially large set of unique files and they are brought to the driver and merged into a single set, which leads to basically having the same copies many times). I am not sure we can overcome it, though. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
