[GitHub] [flink] lirui-apache commented on issue #8449: [FLINK-12235][hive] Support Hive partition in HiveCatalog

GitBox Wed, 22 May 2019 20:12:26 -0700

lirui-apache commented on issue #8449: [FLINK-12235][hive] Support Hive 
partition in HiveCatalog
URL: https://github.com/apache/flink/pull/8449#issuecomment-495051154
 
 
   > @lirui-apache Thank you very much for the update!
   > 
   > I just spotted that, currently how several APIs impl work like this: 1) 
get a raw hive table 2) parse part of the raw table. The latter step actually 
duplicate with logic in `instantiateHiveCatalogTable()`. E.g. 
`ensureTableAndPartitionMatch()` parses `FLINK_PROPERTY_IS_GENERIC`, 
`instantiateHivePartition()` parses partition keys, `ensurePartitionedTable()` 
parses the raw table's partition key size, all of which we can get by just 
parsing the raw table to a `CatalogTable` thru `instantiateHiveCatalogTable()` 
in advance. The current duplication also means if we change some general logic 
in parsing a hive table, we need to change two places. Thus I wonder if it 
makes sense to just parse the raw table as whole at the beginning rather than 
having scattered places each parsing only part of it themselves. And we can 
remove util methods such as `getFieldNames()` which is only used to get the 
partition keys which is already available in `CatalogTable`.
   > 
   > For example, change
   > 
   > ```
   > public void createPartition(...) {
   >   Table hiveTable = getHiveTable(tablePath);
   >   ensureTableAndPartitionMatch(hiveTable, partition);
   >   ensurePartitionedTable(tablePath, hiveTable);
   >   try {
   >     client.add_partition(instantiateHivePartition(hiveTable, 
partitionSpec, partition));
   >   } ...
   > }
   > ```
   > 
   > to something like:
   > 
   > ```
   > public void createPartition(...) {
   >   Table hiveTable = getHiveTable(tablePath);
   >   CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable);
   >   ... check whether catalogTabe and catalogPartition type matches would be 
much easier here ...
   >   ... check whether catalogTable is partitioned would be easier here ...
   >   try {
   >     client.add_partition(
   >       instantiateHivePartition(catalogTable, partitionSpec, partition, 
hiveTable.getSd()));
   >   } ...
   > }
   > ```
   
   I'm not sure how much benefit this can bring us. It might make 
`ensureTableAndPartitionMatch` a little easier -- we can check the type of 
CatalogBaseTable instead of parsing a property. But I don't think we should 
take the same approach for `ensurePartitionedTable`. `ensurePartitionedTable` 
is already simple enough. And for APIs like `listPartitions`, creating a 
CatalogBaseTable just to get num of partition cols seems an overkill to me. And 
the same applies to `getFieldNames`. E.g. I think it'll be an overkill to 
create a CatalogBaseTable to get partition col names in `dropPartition`.
   Maybe an alternative approach to the problem you mentioned is to have more 
util methods, in order to avoid duplication. For example, we should have a util 
method to decide whether a Hive table is generic. And all the APIs needing this 
logic can call the util method. What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] lirui-apache commented on issue #8449: [FLINK-12235][hive] Support Hive partition in HiveCatalog

Reply via email to