[GitHub] [iceberg] openinx commented on pull request #3539: Hive: Allow to create external table to access the iceberg table managed in hive catalog

GitBox Sun, 21 Nov 2021 23:37:24 -0800


openinx commented on pull request #3539:
URL: https://github.com/apache/iceberg/pull/3539#issuecomment-975206805



   > What is the problem with changing the serde and format properties? 
   
   I used the following command to alter the serde and format properties for a 
non hive native iceberg table created by spark/flink engines without 
`engine.hive.enabled=true`, finally the hive engine can  indeed query the 
original iceberg tables now. 
   
   ```
   ALTER TABLE flink_local_nohive SET SERDEPROPERTIES (
       'storage_handler' = 
'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler',
   );
   ALTER TABLE flink_local_nohive SET FILEFORMAT
       INPUTFORMAT 'org.apache.iceberg.mr.hive.HiveIcebergInputFormat'
       OUTPUTFORMAT 'org.apache.iceberg.mr.hive.HiveIcebergOutputFormat'
       SERDE 'org.apache.iceberg.mr.hive.HiveIcebergSerDe';
   ```
   
   But I don't quite recommend to use this approach because it's too internal 
detailed for the users. People need to figure which input/output format classes 
that iceberg is using and which storage handler class is using, also need to 
align the iceberg properties that we set into the hive tables.  For example,  
people need to figure out what's the specific value that set for the 
`external.table.purge` ( In this case, it's an external table so we don't have 
to set).
   
   In short, the reason why I don't recommend it is because I am worried that 
the user has changed or omitted some attributes defined by iceberg, which may 
cause the hive table to be inaccessible by other computing engines. Therefore, 
I instead recommend that the user create another external table to refer to the 
previous table, at least the user does not need to change the iceberg table in 
the original pipeline.
   
   > what exactly is this PR doing? Is it creating a second table and that's 
the source of the already exists error?
   
   Yes,  we are implementing to create a second table to refer to the original 
iceberg tables without enabling `engine.hive.enabled` switch. 
   
   This is because some users have accumulated a large amount of data in the 
iceberg table, but they found that this table cannot be accessed by the hive 
engine. At this time, he faces two choices, one is to modify the original table 
to the hive table, and the other is to use other tables to reference the 
original table. In view of the discussion of the first question, I recommend 
the second method.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] openinx commented on pull request #3539: Hive: Allow to create external table to access the iceberg table managed in hive catalog

Reply via email to