shardulm94 commented on a change in pull request #1837: URL: https://github.com/apache/iceberg/pull/1837#discussion_r531169635
########## File path: site/docs/hive.md ########## @@ -84,7 +84,32 @@ You should now be able to issue Hive SQL `SELECT` queries using the above table SELECT * from table_b; ``` +#### Using Hadoop Catalog +Iceberg tables created using `HadoopCatalog` are stored entirely in a directory in a filesytem like HDFS. + +##### Create an Iceberg table +The first step is to create an Iceberg table using the Spark/Java/Python API and `HadoopCatalog`. For the purposes of this documentation we will assume that the table is called `database_a.table_c` and that the table location is `hdfs://some_path/database_a/table_c`. + +##### Create a Hive table +Now overlay a Hive table on top of this Iceberg table by issuing Hive DDL like so: +```sql +CREATE EXTERNAL TABLE table_a +STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' +LOCATION 'hdfs://some_bucket/some_path/database_a/table_c'; +``` + +#### Query the Iceberg table via Hive +TODO: why does below work if no config settings are set in Hive but fails if we add `set iceberg.mr.catalog=hadoop` like the code suggests we need to do? Review comment: Shouldn't the catalog to use also be specified when creating the table? It seems odd that the consumer has to be aware of the catalog used to store the table. Also what if we need to read multiple tables stored in different underlying catalogs? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org