johnclara edited a comment on pull request #1844: URL: https://github.com/apache/iceberg/pull/1844#issuecomment-742112667
> though I'd like to hear thoughts from others on that, @johnclara? I'm still catching up on all the different configuration locations/barriers. Clearly from my above comments, I'm confused about how it all works together. Ryan's email helped me a lot: https://lists.apache.org/thread.html/r1c357c7867234fed58b12f0cba6e9d2c13ed1af94b8d922ffe3a3def%40%3Cdev.iceberg.apache.org%3E For this PR it sounds like it's still in flux but my understanding is (please correct the things I have wrong): For spark as an example the config barriers are: - User passes: - - command line args + hadoop config files on disk - Spark will push down: - - DataSourceV2 options + hadoop config - Spark source will initialize catalog with: - - Map<String, String> argument (called properties?) + hadoop config Inside the catalog the config sources seem like: - Map<String, String> arguments: system/compute constraints (like running on prem vs on an ec2 instance) - HadoopConfig: for setting up HadoopFileSystem - CatalogProperties: I'm not sure what these are? If it's hive or glue would it be a Map<String, String> per table entry or would it be for the catalog as a whole? ~~In ryan's email I think it says that it should just give the location and that it is an iceberg table (but what if it should also say if it's an iceberg table backed by aws vs gcp and how to read it?)~~ [edit: I'm wrong. I don't understand what he's saying fully here: ```Config in the Hive MetaStore is only used to identify that a table is Iceberg and point to its metadata location. All other config in HMS is informational. For example, the input format is FileInputFormat so that non-Iceberg readers cannot actually instantiate the format (it’s abstract) but it is available so they also don’t fail trying to load the class. Table-specific config should not be stored in table or serde properties.```] - TableProperties: table specific infromation It sounds like for tables which use S3FileIO: - HadoopConfig should be ignored since we're not using HadoopFileIO. - Map<String, String> arguments should give enough info to reach the catalog. - Then catalog properties should be mixed with some arguments which have constraints on where the compute is running (eg proxy for on prem vs an ec2 machine). This should give enough info to be able to read the table's json file. - Then the catalog properties should be ignored and the table properties should be mixed with the arguments. This should give enough information to read and write to the table? Should a file io be created just for the json file and then a new file io be created for the other part of the table after reading the table properties? (manifest list/ manifest + data files?) Where should the kms key id go? Is that a per table thing to write new data files (table properties) or is it a user/system defined property which should go in the arguments? It would help me a ton if that was written out somewhere (it might be in Jack's docs that he's written up I still have to go take a look) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
