[GitHub] [iceberg] szehon-ho opened a new issue #2326: Partition Table Performance

GitBox Thu, 11 Mar 2021 07:59:57 -0800


szehon-ho opened a new issue #2326:
URL: https://github.com/apache/iceberg/issues/2326



   I was struggling the other day with listing partitions of a table with 4000 
manifest files, it takes ~10 minutes on S3, which is a bit of a shock for 
people used to the speed of querying Hive metastore.  While it is possible to 
optimize the metadata, was chatting a bit with @RussellSpitzer about this 
issue, and seems the way to go is making the reading of PartitionTable a proper 
Spark job with predicate push down (https://github.com/apache/iceberg/pull/1421 
and https://github.com/apache/iceberg/issues/1552)
   
   Going back to the common use-case of listing partitions, one quick win may 
be getting the min/max partition, which I saw a lot in Hive, by schedulers like 
airflow sensor to detect new data, or for retention tools to detect and kick-in 
the old data. 
   
   Unfortunately it seems that aggregate push-down is not supported in Spark 
DSV2, unless I am mistaken, which would be the ideal, then it can pushed down 
to the PartitionTable.  But the information seems also available on 
manifest-list (partition boundaries), which would be really fast, but does it 
make sense to expose this just as an Iceberg API for a quick win?
   
   Another workaround is just to push down predicate filter in the existing 
non-distributed PartitionTable TableScan, which is not done today.  I think it 
would be a good change anyway.  The query logic is a bit more complex to find 
the max/min partition (keep expanding the predicate from the expected latest 
until you hit something), but it's useful to answer other queries like for 
example how many partitions for a given day.  Is there any reason not to do 
this? 
   
   @rdblue  @aokolnychyi  do you guys have any thoughts about these ideas, or 
any other thoughts you had in the past about it?  Thanks 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho opened a new issue #2326: Partition Table Performance

Reply via email to