[ 
https://issues.apache.org/jira/browse/HIVE-24263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated HIVE-24263:
-----------------------------
    Description: 
In our company, we have a use-case to get quickly a list of partition 
locations.  Currently it is done via listPartitions, which is a very heavy 
operation in terms of memory and performance.

This JIRA proposes an API: Map<String, String> listPartitionLocations(String 
db, String table, short max) that returns a map of partition names to locations.

For example, we have an integration from output of a Hive pipeline to Spark 
jobs that consume directly from HDFS.  The Spark job scheduler needs to know 
the partition paths that are available for consumption (the partition name is 
not sufficient as it's input is HDFS path), and so we have to do heavy 
listPartitions() for this.

Another use-case is for a HDFS data removal tool that does a nightly crawl to 
see if there are associated hive partitions mapped to a given partition path.  
The nightly crawling job could be much less resource-intensive if we had a 
listPartitionLocations().

As there is already an internal method in the ObjectStore for this done for 
dropPartitions, it is only a matter of exposing this API to HiveMetaStoreClient.



  was:
In our company, we have a use-case to get quickly a list of partition 
locations.  Currently it is done via listPartitions, which is a very heavy 
operation in terms of memory and performance.

For example, we have an integration from output of a Hive pipeline to Spark 
jobs that consume directly from HDFS.  It needs to know the partition paths 
that are available for consumation, and does repeated listPartitions() for this.

As there is already an internal method in the ObjectStore for this done for 
dropPartitions, it is only a matter of exposing this API to HiveMetaStoreClient.




> Create an HMS endpoint to list partition locations
> --------------------------------------------------
>
>                 Key: HIVE-24263
>                 URL: https://issues.apache.org/jira/browse/HIVE-24263
>             Project: Hive
>          Issue Type: Improvement
>          Components: Standalone Metastore
>            Reporter: Szehon Ho
>            Priority: Major
>
> In our company, we have a use-case to get quickly a list of partition 
> locations.  Currently it is done via listPartitions, which is a very heavy 
> operation in terms of memory and performance.
> This JIRA proposes an API: Map<String, String> listPartitionLocations(String 
> db, String table, short max) that returns a map of partition names to 
> locations.
> For example, we have an integration from output of a Hive pipeline to Spark 
> jobs that consume directly from HDFS.  The Spark job scheduler needs to know 
> the partition paths that are available for consumption (the partition name is 
> not sufficient as it's input is HDFS path), and so we have to do heavy 
> listPartitions() for this.
> Another use-case is for a HDFS data removal tool that does a nightly crawl to 
> see if there are associated hive partitions mapped to a given partition path. 
>  The nightly crawling job could be much less resource-intensive if we had a 
> listPartitionLocations().
> As there is already an internal method in the ObjectStore for this done for 
> dropPartitions, it is only a matter of exposing this API to 
> HiveMetaStoreClient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to