[GitHub] [iceberg] johnclara edited a comment on pull request #1844: AWS: support custom client configuration

GitBox Wed, 09 Dec 2020 15:05:06 -0800


johnclara edited a comment on pull request #1844:
URL: https://github.com/apache/iceberg/pull/1844#issuecomment-742112667



   > though I'd like to hear thoughts from others on that, @johnclara?
   
   I'm still catching up on all the different configuration locations/barriers. 
Clearly from my above comments, I'm confused about how it all works together.
   
   Ryan's email helped me a lot: 
https://lists.apache.org/thread.html/r1c357c7867234fed58b12f0cba6e9d2c13ed1af94b8d922ffe3a3def%40%3Cdev.iceberg.apache.org%3E
   
   For this PR it sounds like it's still in flux but my understanding is 
(please correct the things I have wrong):
   
   For spark as an example the config barriers are:
   
   - User passes:
   - - command line args + hadoop config files on disk
   - Spark will push down:
   - - DataSourceV2 options + hadoop config
   - Spark source will initialize catalog with:
   - - Map<String, String> argument (called properties?) + hadoop config
   
   Inside the catalog the config sources seem like:
   
   - Map<String, String> arguments: system/compute constraints (like running on 
prem vs on an ec2 instance)
   - HadoopConfig: for setting up HadoopFileSystem
   - CatalogProperties: I'm not sure what these are? If it's hive or glue would 
it be a Map<String, String> per table entry or would it be for the catalog as a 
whole? ~~In ryan's email I think it says that it should just give the location 
and that it is an iceberg table (but what if it should also say if it's an 
iceberg table backed by aws vs gcp and how to read it?)~~ [edit: I'm wrong. I 
don't understand what he's saying fully here: ```Config in the Hive MetaStore 
is only used to identify that a table is
      Iceberg and point to its metadata location. All other config in HMS is
      informational. For example, the input format is FileInputFormat so that
      non-Iceberg readers cannot actually instantiate the format (it’s abstract)
      but it is available so they also don’t fail trying to load the class.
      Table-specific config should not be stored in table or serde 
properties.```]
   - TableProperties: table specific infromation
   
   It sounds like for tables which use S3FileIO:
   - HadoopConfig should be ignored since we're not using HadoopFileIO.
   - Map<String, String> arguments should give enough info to reach the catalog.
   - Then catalog properties should be mixed with some arguments which have 
constraints on where the compute is running (eg proxy for on prem vs an ec2 
machine). This should give enough info to be able to read the table's json file.
   - Then the catalog properties should be ignored and the table properties 
should be mixed with the arguments. This should give enough information to read 
and write to the table?
   
   Should a file io be created just for the json file and then a new file io be 
created for the other part of the table after reading the table properties? 
(manifest list/ manifest + data files?)
   
   Where should the kms key id go? Is that a per table thing to write new data 
files (table properties) or is it a user/system defined property which should 
go in the arguments?
   
   It would help me a ton if that was written out somewhere (it might be in 
Jack's docs that he's written up I still have to go take a look)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] johnclara edited a comment on pull request #1844: AWS: support custom client configuration

Reply via email to