smaspe commented on issue #2765:
URL: https://github.com/apache/iceberg/issues/2765#issuecomment-932382613


   Hi @kbendick,
   We've upgraded to Spark 3.1.2, with no changes.
   The concern is that updating rows in multiple partitions in a large table is 
extremely slow. We've mitigated so far by splitting queries into chunks that 
target smaller numbers of partitions, but it's far from perfect.
   
   What we don't understand is: how can we tell Iceberg what (hidden) 
partitions to target specifically, so that it doesn't need to scan the whole 
table?
   
   What Canh tried (`ON date(T.published) IN (date '1937-01-01', ...)`) still 
doesn't work in Spark 3.1.2. We can use things like `ON T.published >= date 
'1937-01-01' and T.published < date '1937-01-02'`, but that doesn't seem very 
practical for dozens of partitions (and, if those partitions are sparse, we 
can't join then in a single larger range)
   
   We suspect that it might have to do with 
https://issues.apache.org/jira/browse/SPARK-35245
   > This is because filtering side do not has selective predicate
   
   However I fail to see how we can make the predicate selective, if it isn't 
at the moment.
   
   Thanks in advance for any light you can shed on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to