lirui-apache commented on issue #8449: [FLINK-12235][hive] Support Hive partition in HiveCatalog URL: https://github.com/apache/flink/pull/8449#issuecomment-495051154 > @lirui-apache Thank you very much for the update! > > I just spotted that, currently how several APIs impl work like this: 1) get a raw hive table 2) parse part of the raw table. The latter step actually duplicate with logic in `instantiateHiveCatalogTable()`. E.g. `ensureTableAndPartitionMatch()` parses `FLINK_PROPERTY_IS_GENERIC`, `instantiateHivePartition()` parses partition keys, `ensurePartitionedTable()` parses the raw table's partition key size, all of which we can get by just parsing the raw table to a `CatalogTable` thru `instantiateHiveCatalogTable()` in advance. The current duplication also means if we change some general logic in parsing a hive table, we need to change two places. Thus I wonder if it makes sense to just parse the raw table as whole at the beginning rather than having scattered places each parsing only part of it themselves. And we can remove util methods such as `getFieldNames()` which is only used to get the partition keys which is already available in `CatalogTable`. > > For example, change > > ``` > public void createPartition(...) { > Table hiveTable = getHiveTable(tablePath); > ensureTableAndPartitionMatch(hiveTable, partition); > ensurePartitionedTable(tablePath, hiveTable); > try { > client.add_partition(instantiateHivePartition(hiveTable, partitionSpec, partition)); > } ... > } > ``` > > to something like: > > ``` > public void createPartition(...) { > Table hiveTable = getHiveTable(tablePath); > CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable); > ... check whether catalogTabe and catalogPartition type matches would be much easier here ... > ... check whether catalogTable is partitioned would be easier here ... > try { > client.add_partition( > instantiateHivePartition(catalogTable, partitionSpec, partition, hiveTable.getSd())); > } ... > } > ``` I'm not sure how much benefit this can bring us. It might make `ensureTableAndPartitionMatch` a little easier -- we can check the type of CatalogBaseTable instead of parsing a property. But I don't think we should take the same approach for `ensurePartitionedTable`. `ensurePartitionedTable` is already simple enough. And for APIs like `listPartitions`, creating a CatalogBaseTable just to get num of partition cols seems an overkill to me. And the same applies to `getFieldNames`. E.g. I think it'll be an overkill to create a CatalogBaseTable to get partition col names in `dropPartition`. Maybe an alternative approach to the problem you mentioned is to have more util methods, in order to avoid duplication. For example, we should have a util method to decide whether a Hive table is generic. And all the APIs needing this logic can call the util method. What do you think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
