[ 
https://issues.apache.org/jira/browse/HIVE-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028046#comment-14028046
 ] 

Mithun Radhakrishnan commented on HIVE-7195:
--------------------------------------------

I've been trying to solve the problem from the other end in HCatalog, I.e. 
registering partitions in the metastore, for data that was written to HDFS 
outside of Hive/HCatalog (e.g. through an ingestion service like Apache Falcon, 
etc.) There were several points at which I wished we had an abstraction for a 
"partition-spec", at the metastore level (if not at the ObjectStore level.)

It would be cool to have parallel functions like the following in the 
HiveMetaStore(Client) interface:

{code}
public PartitionSpec listPartitions(db_name, tbl_name, max_parts) throws ... ;
public int add_partitions( PartitionSpec new_parts ) throws ... ;
{code}

where the PartitionSpec looks like:

{code}
public interface PartitionSpec {
        public List<Partition> getPartitions();
        public List<String> getPartNames();
        public Iterator<Partition> getPartitionIter();
        public Iterator<String> getPartNameIter();
}
{code}

The DefaultPartitionSpec composes a List<Partition>. 
An HDFSDirBasedPartitionSpec could be implemented to store a root-level 
partition-dir, and return Partition objects via globStatus() on HDFS. I would 
use this as an argument to addPartitions(PartitionSpec), to avoid having to 
specify all partitions explicitly. This avoids a bunch of thrift-serialization 
and traffic over the wire.
A future PartitionSpec could choose to compose other PartitionSpecs.
HiveMetaStoreClient.listPartitions() could choose to return a PartitionSpec 
that composes several Partition objects that use the same StorageDescriptor 
instance, so that 10000 partitions with nearly the same SD don't repeat the 
redundant bits.

I haven't worked out the nuts-and-bolts completely. I'll put a more complete 
proposal out on a separate JIRA. I think this will have value for both 
listPartitions() (i.e. read) and addPartitions() (i.e. write). I'd value your 
opinion on the approach.

> Improve Metastore performance
> -----------------------------
>
>                 Key: HIVE-7195
>                 URL: https://issues.apache.org/jira/browse/HIVE-7195
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Brock Noland
>            Priority: Critical
>
> Even with direct SQL, which significantly improves MS performance, some 
> operations take a considerable amount of time, when there are many partitions 
> on table. Specifically I believe the issue:
> * When a client gets all partitions we do not send them an iterator, we 
> create a collection of all data and then pass the object over the network in 
> total
> * Operations which require looking up data on the NN can still be slow since 
> there is no cache of information and it's done in a serial fashion
> * Perhaps a tangent, but our client timeout is quite dumb. The client will 
> timeout and the server has no idea the client is gone. We should use 
> deadlines, i.e. pass the timeout to the server so it can calculate that the 
> client has expired.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to