[GitHub] [iceberg] aokolnychyi commented on issue #1422: Distributing Job Planning on Spark

GitBox Thu, 03 Sep 2020 22:24:25 -0700


aokolnychyi commented on issue #1422:
URL: https://github.com/apache/iceberg/issues/1422#issuecomment-686915430



   To give a bit more background, job planning is fast even on huge tables if 
we have a partition predicate and end up processing 10-15 partitions in a 
single job. We have `RewriteManifestsAction` to rewrite metadata and align it 
with partitions on demand. That covers all common use cases.
   
   At the same time, Iceberg supports file filtering within partitions and 
opens the opportunity for efficient *full* table scans. So there are two use 
cases we want to address:
   
   - job planning for full table scans (less important)
   - job planning for queries with predicates only on sort key in partitioned 
tables (becomes common in Iceberg tables)
   
   The second use case is the primary one. It is a common pattern to partition 
tables by date and sort by key and have queries *without* partition predicates 
but with predicates on the sort key. Such use cases were not possible without 
Iceberg since we could not filter files within partitions and we ended up with 
full table scans. Right now, we can narrow down the number of matching files to 
1-2 per partition. So we can scan PB scale tables for a given key faster. If we 
do so right now, we end up spending most of the time during job planning. 
That's why it would be better to parallelize that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on issue #1422: Distributing Job Planning on Spark

Reply via email to