[jira] [Comment Edited] (HIVE-12285) Add locking to HCatClient

Sushanth Sowmyan (JIRA) Thu, 29 Oct 2015 16:29:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-12285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981551#comment-14981551
 ]


Sushanth Sowmyan edited comment on HIVE-12285 at 10/29/15 11:29 PM:
--------------------------------------------------------------------

Reading your comments brought upon a sad smile, [~teabot] :)

As with any old project, intentions that are still half-implemented and legacy 
compatibility wind up causing issues eventually. Eugene answered most of the 
specific yes/no aspects of your questions, so I'll try to ramble on and give 
some historical context. :)

HCatalog has been a loose collection of api points, wherein, the original goal 
behind HCatalog was to be a metastore-based storage abstraction layer for all 
of hadoop, not just hive. To that end, in essence, the original architecture 
goal for HCatalog was to replace Hive's metastore and StorageHandler 
subsystems, so that hive would sit on top of hcatalog, and hcatalog would sit 
on top of M/R. In addition, the goal was to add multiple api points for 
HCatalog so that products other than hive, such as pig or custom mapreduce 
programs could share the same backend.

Now, the way integration with Hive wound up going down, most of the changes 
that happened with the hcat metastore wound up being contributed back into the 
hive metastore, which remained a separate entity. And, in addition, Instead of 
HCat replacing Hive's StorageHandler, since there was a lot of disagreement in 
the community, and since cross-tool compatibility was still a primary goal, we 
wound up going the route where HCat will plug in and use Hive's StorageHandler 
systems (with a bit of enhancement to them that was added along the way). So, 
now, instead of HCat being a common core, it sat in parallel with hive, using 
the same bits hive did, but in a repeat-implementation sort of manner, and its 
primary user is not hive, but other tools like pig , custom M/R jobs, etc.

WebHCat was an attempt to have a gateway service that allowed you to do various 
table management functions, some minor scheduling services, and was intended to 
act as a secure REST endpoint that people could use, and so it does what it 
does. However, in trying to do all it does, I think that as of today, Oozie or 
Hue might be more of use than WebHCat.

hcat CLI was something that, initially, used to mimic hive CLI, but used to 
perform the task of being aware of HCat's StorageDriver system in addition to 
supporting traditional IF/OF systems. However, with HCat adopting hive's 
StorageHandler concept, and deprecating and removing StorageDrivers, hcat CLI 
is a thin dupe of the hive CLI, except for one thing it did differently - it 
allowed easy blocking off of non-DDL commands. Thus, hcat CLI would always run 
only pure hive code, with no user-defined classes ever being loaded. This makes 
hcat CLI more trustable in a secure environment as a privileged user. Thus, 
despite thoughts of deprecating and removing the hcat CLI, it lived on for this 
purpose, and WebHCat would run hcat CLI for all its DDL actions behind the 
scenes, so hcat CLI lived on in this limited context.

Associated with that, WebHCat was trying to specify a java client to talk to 
it, and the notion was to try and determine a "proper" API that would allow a 
user to connect to WebHCat, and not need any traditional Hive jars on the 
client side. This specification is what HCatClient eventually wound up being.

Once the specification was in, the next goal was to come up with a proper 
client that implemented HCatClient and talked to WebHCat. However, due to a 
lack of interest among users in such a thing, the only implementation of 
HCatClient that existed was HMSHCatClient, which was initially only intended to 
test the HCatClient interface, not to be the only implementation of that 
interface.

The HCatClient interface, however, has proven to be popular enough an interface 
that it has attracted a lot of users, and it has been a useful interface for us 
to suggest to people to use as well, because it means external tools like 
Falcon can use an abstracted interface like HCatClient, rather than be tied to 
interfaces like IMetaStoreClient which we would prefer to be interfaces 
internal to hive.

However, we (a) do not have an impl of HCatClient other than the HMSImpl and 
(b) the original goal of wanting to have jar-separation was not a goal that 
found enough traction.

About two years back, I suggested deprecating all of the webhcat-java-client 
package, with a view to replacing it with a top-level hive-api package which 
would contain equivalent APIs that are intended for public consumption. This 
was met with some balking from the community, so we have the packages and 
sporadic spread of api as we currently do.

At this point, I do not think WebHCat has very many users itself, and could 
probably use being spun off out of hive to trim and clean up hive, and I still 
think HCatClient should become a top-level hive-api module and get rid of the 
REST expectations.


was (Author: sushanth):
Reading your comments brought upon a sad smile, [~teabot] :)

As with any old project, intentions that are still half-implemented and legacy 
compatibility wind up causing issues eventually. Eugene answered most of the 
specific yes/no aspects of your questions, so I'll try to ramble on and give 
some historical context. :)

HCatalog has been a loose collection of api points, wherein, the original goal 
behind HCatalog was to be a metastore-based storage abstraction layer for all 
of hadoop, not just hive. To that end, in essence, the original architecture 
goal for HCatalog was to replace Hive's metastore and StorageHandler 
subsystems, so that hive would sit on top of hcatalog, and hcatalog would sit 
on top of M/R. In addition, the goal was to add multiple api points for 
HCatalog so that products other than hive, such as pig or custom mapreduce 
programs could share the same backend.

Now, the way integration with Hive wound up going down, most of the changes 
that happened with the hcat metastore wound up being contributed back into the 
hive metastore, which remained a separate entity. And, in addition, Instead of 
HCat replacing Hive's StorageHandler, since there was a lot of disagreement in 
the community, and since cross-tool compatibility was still a primary goal, we 
wound up going the route where HCat will plug in and use Hive's StorageHandler 
systems (with a bit of enhancement to them that was added along the way). So, 
now, instead of HCat being a common core, it sat in parallel with hive, using 
the same bits hive did, but in a repeat-implementation sort of manner, and its 
primary user is not hive, but other tools like pig , custom M/R jobs, etc.

WebHCat was an attempt to have a gateway service that allowed you to do various 
table management functions, some minor scheduling services, and was intended to 
act as a secure REST endpoint that people could use, and so it does what it 
does. However, in trying to do all it does, I think that as of today, Oozie or 
Hue might be more of use than WebHCat.

hcat CLI was something that, initially, used to mimic hive CLI, but used to 
perform the task of being aware of HCat's StorageDriver system in addition to 
supporting traditional IF/OF systems. However, with HCat adopting hive's 
StorageHandler concept, and deprecating and removing StorageDrivers, hcat CLI 
is a thin dupe of the hive CLI, except for one thing it did differently - it 
allowed easy blocking off of non-DDL commands. Thus, hcat CLI would always run 
only pure hive code, with no user-defined classes ever being loaded. This makes 
hcat CLI more trustable in a secure environment as a privileged user. Thus, 
despite thoughts of deprecating and removing the hcat CLI, it lived on for this 
purpose, and WebHCat would run hcat CLI for all its DDL actions behind the 
scenes, so hcat CLI lived on in this limited context.

Associated with that, WebHCat was trying to specify a java client to talk to 
it, and the notion was to try and determine a "proper" API that would allow a 
user to connect to WebHCat, and not need any traditional Hive jars on the 
client side. This specification is what HCatClient eventually wound up being.

Once the specification was in, the next goal was to come up with a proper 
client that implemented HCatClient and talked to WebHCat. However, due to a 
lack of interest among users in such a thing, the only implementation of 
HCatClient that existed was HMSHCatClient, which was initially only intended to 
test the HCatClient interface, not to be the only implementation of that 
interface.

The HCatClient interface, however, has proven to be popular enough an interface 
that it has attracted a lot of users, and it has been a useful interface for us 
to suggest to people to use as well, because it means external tools like 
Falcon can use an abstracted interface like HCatClient, rather than be tied to 
interfaces like IMetaStoreClient which we would prefer to be interfaces 
internal to hive.

However, we (a) do not have an impl of HCatClient other than the HMSImpl and 
(b) the original goal of wanting to have jar-separation was not a goal that 
found enough traction.

About two years back, I suggested deprecating all of the webhcat-java-client 
package, with a view to replacing it with a top-level hive-api package which 
would contain equivalent APIs that are intended for public consumption. This 
was met with some balking from the community, so we have the packages and 
sporadic spread of api as we currently do.

At this point, I do not think WebHCat has very many users itself, and could 
probably use being spun off out of hive to trim and clean up hive.

> Add locking to HCatClient
> -------------------------
>
>                 Key: HIVE-12285
>                 URL: https://issues.apache.org/jira/browse/HIVE-12285
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 2.0.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: concurrency, hcatalog, lock, locking, locks
>
> With the introduction of a concurrency model (HIVE-1293) Hive uses locks to 
> coordinate  access and updates to both table data and metadata. Within the 
> Hive CLI such lock management is seamless. However, Hive provides additional 
> APIs that permit interaction with data repositories, namely the HCatalog 
> APIs. Currently, operations implemented by this API do not participate with 
> Hive's locking scheme. Furthermore, access to the locking mechanisms is not 
> exposed by the APIs (as is the case with the Metastore Thrift API) and so 
> users are not able to explicitly interact with locks either. This has created 
> a less than ideal situation where users of the APIs have no choice but to 
> manipulate these data repositories outside of the command of Hive's lock 
> management, potentially resulting in situations where data inconsistencies 
> can occur both for external processes using the API and for queries executing 
> within Hive.
> h3. Scope of work
> This ticket is concerned with sections of the HCatalog API that deal with DDL 
> type operations using the metastore, not with those whose purpose is to 
> read/write table data. A separate issue already exists for adding locking to 
> HCat readers and writers (HIVE-6207).
> h3. Proposed work
> The following work items would serve as a minimum deliverable that would both 
> allow API users to effectively work with locks:
> * Comprehensively document on the wiki the locks required for various Hive 
> operations. At a minimum this should cover all operations exposed by 
> {{HCatClient}}. The [Locking design 
> document|https://cwiki.apache.org/confluence/display/Hive/Locking] can be 
> used as a starting point or perhaps updated.
> * Implement methods and types in the {{HCatClient}} API that allow users to 
> manipulate Hive locks. For the most part I'd expect these to delegate to the 
> metastore API implementations:
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.checkLock(long)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)}}
> ** -{{org.apache.hadoop.hive.metastore.IMetaStoreClient.showLocks()}}-
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.heartbeat(long, long)}}
> ** {{org.apache.hadoop.hive.metastore.api.LockComponent}}
> ** {{org.apache.hadoop.hive.metastore.api.LockRequest}}
> ** {{org.apache.hadoop.hive.metastore.api.LockResponse}}
> ** {{org.apache.hadoop.hive.metastore.api.LockLevel}}
> ** {{org.apache.hadoop.hive.metastore.api.LockType}}
> ** {{org.apache.hadoop.hive.metastore.api.LockState}}
> ** -{{org.apache.hadoop.hive.metastore.api.ShowLocksResponse}}-
> h3. Additional proposals
> Explicit lock management should be fairly simple to add to {{HCatClient}}, 
> however it puts the onus on the API user to correctly understand and 
> implement code that uses lock in an appropriate manner. Failure to do so may 
> have undesirable consequences. With a simpler user model the operations 
> exposed on the API would automatically acquire and release the locks that 
> they need. This might work well for small numbers of operations, but not 
> perhaps for large sequences of invocations. (Do we need to worry about this 
> though as the API methods usually accept batches?).  Additionally tasks such 
> as heartbeat management could also be handled implicitly for long running 
> sets of operations. With these concerns in mind it may also be beneficial to 
> deliver some of the following:
> * A means to automatically acquire/release appropriate locks for 
> {{HCatClient}} operations.
> * A component that maintains a lock heartbeat from the client.
> * A strategy for switching between manual/automatic lock management, 
> analogous to SQL's {{autocommit}} for transactions.
> An API for lock and heartbeat management already exists in the HCatalog 
> Mutation API (see: 
> {{org.apache.hive.hcatalog.streaming.mutate.client.lock}}). It will likely 
> make sense to refactor either this code and/or code that uses it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-12285) Add locking to HCatClient

Reply via email to