rdblue commented on pull request #2129:
URL: https://github.com/apache/iceberg/pull/2129#issuecomment-769503071
This made me think a lot about what the behavior should be. Before diving
further into the review, I think we should agree on that.
We have 3 cases for loading tables:
* Hadoop table identified by its location
* Hive catalog table tracked in the current metastore
* A table in any catalog where the metastore is a reference
Ideally, a Hive table that is owned by the Hive metastore would require no
additional property to point it to a catalog, because the metastore is unnamed
and is basically global. A table in another catalog would have a table
property, `iceberg.catalog`, that has the other catalog's name. That leaves
Hadoop tables and I would use a property like `iceberg.type=hadoop` to signal
that the table location should be used to load it. I would not use a reserved
catalog name to identify Hadoop tables to avoid colliding with user catalog
names (e.g., "hadoop"), but we could make it reasonably unlikely if we wanted
to use `iceberg.catalog=location_based_table` or something.
We also have existing behavior:
* `iceberg.mr.catalog` is not defined: load all Iceberg tables as Hadoop
tables using the location
* `iceberg.mr.catalog=hive`: load all Iceberg tables using the table name
from a HiveCatalog for the metastore URI
* `iceberg.mr.catalog=hadoop`: load all Iceberg tables using the table name
from a HadoopCatalog for the location stored in
`iceberg.mr.catalog.hadoop.warehouse.location`
* `iceberg.mr.catalog.loader.class`: when defined, load all Iceberg tables
using the catalog produced by the loader class. This is primarily for direct
use of the `InputFormat`, not for Hive.
Unfortunately, the current behavior when `iceberg.mr.catalog` is not set and
the table has no catalog set is to load the table's location. The "ideal"
behavior I outlined above would load the table from an anonymous `HiveCatalog`
created for the metastore URI. I don't really like the option of adding more
table-level config if we can avoid it.
Here's what I propose:
* If a table has `iceberg.catalog` or `iceberg.type=hadoop`, then use the
ideal behavior (no conflict with existing)
* If a table does not have those properties, fall back to the value of
`iceberg.mr.catalog`:
* When not defined: load the Iceberg table from the metastore by name
* `location`: load the Iceberg table as a Hadoop table by location
* `hive`: use the existing behavior for backward-compatibility
* `hadoop`: use the existing behavior for backward-compatibility
I think this is a good compromise for behavior. If someone wants the default
to be location-based table loading, then all they need to change is to set
`iceberg.mr.catalog=location`. The other cases are unchanged, and the default
is to load tables from the metastore (as expected). While it is a behavior
change, I think it is a reasonable one.
Miscellaneous:
* I didn't mention a property to control table identifier above, like
`iceberg.identifier`, but I think that would make sense. I would not support it
for tables tracked in the metastore, though. It does not make sense to have a
pointer to another table owned by the metastore.
* Because the loader class is intended for the `InputFormat`, I think we
should handle it in Hive by making sure it is not set or by using a different
load method that ignores it. There should be no need to provide backward
compatibility for this and we haven't documented it anywhere.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]