[GitHub] [iceberg] rdblue commented on pull request #2129: Hive: Configure catalog type on table level.

GitBox Thu, 28 Jan 2021 17:09:09 -0800


rdblue commented on pull request #2129:
URL: https://github.com/apache/iceberg/pull/2129#issuecomment-769503071



   This made me think a lot about what the behavior should be. Before diving 
further into the review, I think we should agree on that.
   
   We have 3 cases for loading tables:
   * Hadoop table identified by its location
   * Hive catalog table tracked in the current metastore
   * A table in any catalog where the metastore is a reference
   
   Ideally, a Hive table that is owned by the Hive metastore would require no 
additional property to point it to a catalog, because the metastore is unnamed 
and is basically global. A table in another catalog would have a table 
property, `iceberg.catalog`, that has the other catalog's name. That leaves 
Hadoop tables and I would use a property like `iceberg.type=hadoop` to signal 
that the table location should be used to load it. I would not use a reserved 
catalog name to identify Hadoop tables to avoid colliding with user catalog 
names (e.g., "hadoop"), but we could make it reasonably unlikely if we wanted 
to use `iceberg.catalog=location_based_table` or something.
   
   We also have existing behavior:
   * `iceberg.mr.catalog` is not defined: load all Iceberg tables as Hadoop 
tables using the location
   * `iceberg.mr.catalog=hive`: load all Iceberg tables using the table name 
from a HiveCatalog for the metastore URI
   * `iceberg.mr.catalog=hadoop`: load all Iceberg tables using the table name 
from a HadoopCatalog for the location stored in 
`iceberg.mr.catalog.hadoop.warehouse.location`
   * `iceberg.mr.catalog.loader.class`: when defined, load all Iceberg tables 
using the catalog produced by the loader class. This is primarily for direct 
use of the `InputFormat`, not for Hive.
   
   Unfortunately, the current behavior when `iceberg.mr.catalog` is not set and 
the table has no catalog set is to load the table's location. The "ideal" 
behavior I outlined above would load the table from an anonymous `HiveCatalog` 
created for the metastore URI. I don't really like the option of adding more 
table-level config if we can avoid it.
   
   Here's what I propose:
   * If a table has `iceberg.catalog` or `iceberg.type=hadoop`, then use the 
ideal behavior (no conflict with existing)
   * If a table does not have those properties, fall back to the value of 
`iceberg.mr.catalog`:
       * When not defined: load the Iceberg table from the metastore by name
       * `location`: load the Iceberg table as a Hadoop table by location
       * `hive`: use the existing behavior for backward-compatibility
       * `hadoop`: use the existing behavior for backward-compatibility
   
   I think this is a good compromise for behavior. If someone wants the default 
to be location-based table loading, then all they need to change is to set 
`iceberg.mr.catalog=location`. The other cases are unchanged, and the default 
is to load tables from the metastore (as expected). While it is a behavior 
change, I think it is a reasonable one.
   
   Miscellaneous:
   * I didn't mention a property to control table identifier above, like 
`iceberg.identifier`, but I think that would make sense. I would not support it 
for tables tracked in the metastore, though. It does not make sense to have a 
pointer to another table owned by the metastore.
   * Because the loader class is intended for the `InputFormat`, I think we 
should handle it in Hive by making sure it is not set or by using a different 
load method that ignores it. There should be no need to provide backward 
compatibility for this and we haven't documented it anywhere.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #2129: Hive: Configure catalog type on table level.

Reply via email to