smaspe commented on issue #2765: URL: https://github.com/apache/iceberg/issues/2765#issuecomment-932382613
Hi @kbendick, We've upgraded to Spark 3.1.2, with no changes. The concern is that updating rows in multiple partitions in a large table is extremely slow. We've mitigated so far by splitting queries into chunks that target smaller numbers of partitions, but it's far from perfect. What we don't understand is: how can we tell Iceberg what (hidden) partitions to target specifically, so that it doesn't need to scan the whole table? What Canh tried (`ON date(T.published) IN (date '1937-01-01', ...)`) still doesn't work in Spark 3.1.2. We can use things like `ON T.published >= date '1937-01-01' and T.published < date '1937-01-02'`, but that doesn't seem very practical for dozens of partitions (and, if those partitions are sparse, we can't join then in a single larger range) We suspect that it might have to do with https://issues.apache.org/jira/browse/SPARK-35245 > This is because filtering side do not has selective predicate However I fail to see how we can make the predicate selective, if it isn't at the moment. Thanks in advance for any light you can shed on this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
