[GitHub] [iceberg] pvary opened a new pull request #1920: Hive: Serialize metadata location so split generation does not need to load the table

GitBox Fri, 11 Dec 2020 21:50:06 -0800


pvary opened a new pull request #1920:
URL: https://github.com/apache/iceberg/pull/1920



   Before this change the split generation loads the table and uses that to 
generate the scan tasks.
   This could be problematic:
   1. Split generation happens on TezAM - currently we do not have any 
connection between the TezAMs and the HMS. This could cause extra load and 
needs extra network configuration/traffic
   2. Split generation happens after the query planning and the Table could 
have changed in the meantime. In the longer term we have to find a way to use 
the same snapshot throughout the planning and the execution process
   
   As a first step, this PR creates `StaticTable` which is  a specific snapshot 
of the Table, and serializes the data required for the creation of this table 
to the job configuration. This solves 1. and provides a way forward to solve 2.
   
   Since all of the InputFormat tests are using the same codepath, no extra 
tests are added


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary opened a new pull request #1920: Hive: Serialize metadata location so split generation does not need to load the table

Reply via email to