[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

GitBox Thu, 04 Feb 2021 15:07:12 -0800


rdblue commented on issue #2068:
URL: https://github.com/apache/iceberg/issues/2068#issuecomment-773662070



   Just catching up on this thread.
   
   I agree with everything that @electrum said. Strong expectations help us. 
But we also need to have a way to handle existing data. That comes up all the 
time. Most existing data is tracked by name.
   
   To support existing data files, we built name mappings so that you can take 
table data that previously identified columns by name and attach IDs. As long 
as the files that used name-based schema evolution are properly mapped, the 
Iceberg table can carry on with id-based resolution without problems. I think 
this is a reasonable path forward to get schema IDs.
   
   Position-based schema evolution isn't very popular and doesn't work well 
with nested structures, so I think we should focus on name-based.
   
   I completely agree that we need to read part of each data file. For Parquet, 
we need to get column stats at a minimum, but we should also validate that at 
least one column is readable.
   
   I think that means that we have a few things to do to formally support this:
   1. Add name mapping to the Iceberg spec so that it is well-defined and we 
have test cases to validate
   2. Document how name mappings change when a schema evolves (allows adding 
aliases)
   3. Make sure that when we import files, there is a name mapping set for the 
table
   4. Build correct metadata from imported files based on the name mapping
   5. Identify problems with the name mapping, like files with no readable / 
mapped fields or incompatible data types
   
   One last issue to note is how to get the partition for a data file. The 
current import code assumes that files are coming from a Hive table layout and 
converts the path to partition values as Hive would. We will need to make sure 
that an import procedure has a plan for handling this.
   
   And I should note that only Hive partition paths can be parsed. Iceberg 
purposely makes no guarantee that partition values can be recovered from paths 
and considers partition values to path conversion to be one-way. So we may want 
to have a way to pass the partition tuple. A struct is one option, but that 
makes it especially hard to use the date transforms. Another option is to 
import everything at the top level, or to try to infer values from lower/upper 
bounds in column stats.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #2068: Procedure for adding files to a Table

Reply via email to