[
https://issues.apache.org/jira/browse/HIVE-12285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981551#comment-14981551
]
Sushanth Sowmyan edited comment on HIVE-12285 at 10/29/15 11:29 PM:
--------------------------------------------------------------------
Reading your comments brought upon a sad smile, [~teabot] :)
As with any old project, intentions that are still half-implemented and legacy
compatibility wind up causing issues eventually. Eugene answered most of the
specific yes/no aspects of your questions, so I'll try to ramble on and give
some historical context. :)
HCatalog has been a loose collection of api points, wherein, the original goal
behind HCatalog was to be a metastore-based storage abstraction layer for all
of hadoop, not just hive. To that end, in essence, the original architecture
goal for HCatalog was to replace Hive's metastore and StorageHandler
subsystems, so that hive would sit on top of hcatalog, and hcatalog would sit
on top of M/R. In addition, the goal was to add multiple api points for
HCatalog so that products other than hive, such as pig or custom mapreduce
programs could share the same backend.
Now, the way integration with Hive wound up going down, most of the changes
that happened with the hcat metastore wound up being contributed back into the
hive metastore, which remained a separate entity. And, in addition, Instead of
HCat replacing Hive's StorageHandler, since there was a lot of disagreement in
the community, and since cross-tool compatibility was still a primary goal, we
wound up going the route where HCat will plug in and use Hive's StorageHandler
systems (with a bit of enhancement to them that was added along the way). So,
now, instead of HCat being a common core, it sat in parallel with hive, using
the same bits hive did, but in a repeat-implementation sort of manner, and its
primary user is not hive, but other tools like pig , custom M/R jobs, etc.
WebHCat was an attempt to have a gateway service that allowed you to do various
table management functions, some minor scheduling services, and was intended to
act as a secure REST endpoint that people could use, and so it does what it
does. However, in trying to do all it does, I think that as of today, Oozie or
Hue might be more of use than WebHCat.
hcat CLI was something that, initially, used to mimic hive CLI, but used to
perform the task of being aware of HCat's StorageDriver system in addition to
supporting traditional IF/OF systems. However, with HCat adopting hive's
StorageHandler concept, and deprecating and removing StorageDrivers, hcat CLI
is a thin dupe of the hive CLI, except for one thing it did differently - it
allowed easy blocking off of non-DDL commands. Thus, hcat CLI would always run
only pure hive code, with no user-defined classes ever being loaded. This makes
hcat CLI more trustable in a secure environment as a privileged user. Thus,
despite thoughts of deprecating and removing the hcat CLI, it lived on for this
purpose, and WebHCat would run hcat CLI for all its DDL actions behind the
scenes, so hcat CLI lived on in this limited context.
Associated with that, WebHCat was trying to specify a java client to talk to
it, and the notion was to try and determine a "proper" API that would allow a
user to connect to WebHCat, and not need any traditional Hive jars on the
client side. This specification is what HCatClient eventually wound up being.
Once the specification was in, the next goal was to come up with a proper
client that implemented HCatClient and talked to WebHCat. However, due to a
lack of interest among users in such a thing, the only implementation of
HCatClient that existed was HMSHCatClient, which was initially only intended to
test the HCatClient interface, not to be the only implementation of that
interface.
The HCatClient interface, however, has proven to be popular enough an interface
that it has attracted a lot of users, and it has been a useful interface for us
to suggest to people to use as well, because it means external tools like
Falcon can use an abstracted interface like HCatClient, rather than be tied to
interfaces like IMetaStoreClient which we would prefer to be interfaces
internal to hive.
However, we (a) do not have an impl of HCatClient other than the HMSImpl and
(b) the original goal of wanting to have jar-separation was not a goal that
found enough traction.
About two years back, I suggested deprecating all of the webhcat-java-client
package, with a view to replacing it with a top-level hive-api package which
would contain equivalent APIs that are intended for public consumption. This
was met with some balking from the community, so we have the packages and
sporadic spread of api as we currently do.
At this point, I do not think WebHCat has very many users itself, and could
probably use being spun off out of hive to trim and clean up hive, and I still
think HCatClient should become a top-level hive-api module and get rid of the
REST expectations.
was (Author: sushanth):
Reading your comments brought upon a sad smile, [~teabot] :)
As with any old project, intentions that are still half-implemented and legacy
compatibility wind up causing issues eventually. Eugene answered most of the
specific yes/no aspects of your questions, so I'll try to ramble on and give
some historical context. :)
HCatalog has been a loose collection of api points, wherein, the original goal
behind HCatalog was to be a metastore-based storage abstraction layer for all
of hadoop, not just hive. To that end, in essence, the original architecture
goal for HCatalog was to replace Hive's metastore and StorageHandler
subsystems, so that hive would sit on top of hcatalog, and hcatalog would sit
on top of M/R. In addition, the goal was to add multiple api points for
HCatalog so that products other than hive, such as pig or custom mapreduce
programs could share the same backend.
Now, the way integration with Hive wound up going down, most of the changes
that happened with the hcat metastore wound up being contributed back into the
hive metastore, which remained a separate entity. And, in addition, Instead of
HCat replacing Hive's StorageHandler, since there was a lot of disagreement in
the community, and since cross-tool compatibility was still a primary goal, we
wound up going the route where HCat will plug in and use Hive's StorageHandler
systems (with a bit of enhancement to them that was added along the way). So,
now, instead of HCat being a common core, it sat in parallel with hive, using
the same bits hive did, but in a repeat-implementation sort of manner, and its
primary user is not hive, but other tools like pig , custom M/R jobs, etc.
WebHCat was an attempt to have a gateway service that allowed you to do various
table management functions, some minor scheduling services, and was intended to
act as a secure REST endpoint that people could use, and so it does what it
does. However, in trying to do all it does, I think that as of today, Oozie or
Hue might be more of use than WebHCat.
hcat CLI was something that, initially, used to mimic hive CLI, but used to
perform the task of being aware of HCat's StorageDriver system in addition to
supporting traditional IF/OF systems. However, with HCat adopting hive's
StorageHandler concept, and deprecating and removing StorageDrivers, hcat CLI
is a thin dupe of the hive CLI, except for one thing it did differently - it
allowed easy blocking off of non-DDL commands. Thus, hcat CLI would always run
only pure hive code, with no user-defined classes ever being loaded. This makes
hcat CLI more trustable in a secure environment as a privileged user. Thus,
despite thoughts of deprecating and removing the hcat CLI, it lived on for this
purpose, and WebHCat would run hcat CLI for all its DDL actions behind the
scenes, so hcat CLI lived on in this limited context.
Associated with that, WebHCat was trying to specify a java client to talk to
it, and the notion was to try and determine a "proper" API that would allow a
user to connect to WebHCat, and not need any traditional Hive jars on the
client side. This specification is what HCatClient eventually wound up being.
Once the specification was in, the next goal was to come up with a proper
client that implemented HCatClient and talked to WebHCat. However, due to a
lack of interest among users in such a thing, the only implementation of
HCatClient that existed was HMSHCatClient, which was initially only intended to
test the HCatClient interface, not to be the only implementation of that
interface.
The HCatClient interface, however, has proven to be popular enough an interface
that it has attracted a lot of users, and it has been a useful interface for us
to suggest to people to use as well, because it means external tools like
Falcon can use an abstracted interface like HCatClient, rather than be tied to
interfaces like IMetaStoreClient which we would prefer to be interfaces
internal to hive.
However, we (a) do not have an impl of HCatClient other than the HMSImpl and
(b) the original goal of wanting to have jar-separation was not a goal that
found enough traction.
About two years back, I suggested deprecating all of the webhcat-java-client
package, with a view to replacing it with a top-level hive-api package which
would contain equivalent APIs that are intended for public consumption. This
was met with some balking from the community, so we have the packages and
sporadic spread of api as we currently do.
At this point, I do not think WebHCat has very many users itself, and could
probably use being spun off out of hive to trim and clean up hive.
> Add locking to HCatClient
> -------------------------
>
> Key: HIVE-12285
> URL: https://issues.apache.org/jira/browse/HIVE-12285
> Project: Hive
> Issue Type: Improvement
> Components: HCatalog
> Affects Versions: 2.0.0
> Reporter: Elliot West
> Assignee: Elliot West
> Labels: concurrency, hcatalog, lock, locking, locks
>
> With the introduction of a concurrency model (HIVE-1293) Hive uses locks to
> coordinate access and updates to both table data and metadata. Within the
> Hive CLI such lock management is seamless. However, Hive provides additional
> APIs that permit interaction with data repositories, namely the HCatalog
> APIs. Currently, operations implemented by this API do not participate with
> Hive's locking scheme. Furthermore, access to the locking mechanisms is not
> exposed by the APIs (as is the case with the Metastore Thrift API) and so
> users are not able to explicitly interact with locks either. This has created
> a less than ideal situation where users of the APIs have no choice but to
> manipulate these data repositories outside of the command of Hive's lock
> management, potentially resulting in situations where data inconsistencies
> can occur both for external processes using the API and for queries executing
> within Hive.
> h3. Scope of work
> This ticket is concerned with sections of the HCatalog API that deal with DDL
> type operations using the metastore, not with those whose purpose is to
> read/write table data. A separate issue already exists for adding locking to
> HCat readers and writers (HIVE-6207).
> h3. Proposed work
> The following work items would serve as a minimum deliverable that would both
> allow API users to effectively work with locks:
> * Comprehensively document on the wiki the locks required for various Hive
> operations. At a minimum this should cover all operations exposed by
> {{HCatClient}}. The [Locking design
> document|https://cwiki.apache.org/confluence/display/Hive/Locking] can be
> used as a starting point or perhaps updated.
> * Implement methods and types in the {{HCatClient}} API that allow users to
> manipulate Hive locks. For the most part I'd expect these to delegate to the
> metastore API implementations:
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.checkLock(long)}}
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)}}
> ** -{{org.apache.hadoop.hive.metastore.IMetaStoreClient.showLocks()}}-
> ** {{org.apache.hadoop.hive.metastore.IMetaStoreClient.heartbeat(long, long)}}
> ** {{org.apache.hadoop.hive.metastore.api.LockComponent}}
> ** {{org.apache.hadoop.hive.metastore.api.LockRequest}}
> ** {{org.apache.hadoop.hive.metastore.api.LockResponse}}
> ** {{org.apache.hadoop.hive.metastore.api.LockLevel}}
> ** {{org.apache.hadoop.hive.metastore.api.LockType}}
> ** {{org.apache.hadoop.hive.metastore.api.LockState}}
> ** -{{org.apache.hadoop.hive.metastore.api.ShowLocksResponse}}-
> h3. Additional proposals
> Explicit lock management should be fairly simple to add to {{HCatClient}},
> however it puts the onus on the API user to correctly understand and
> implement code that uses lock in an appropriate manner. Failure to do so may
> have undesirable consequences. With a simpler user model the operations
> exposed on the API would automatically acquire and release the locks that
> they need. This might work well for small numbers of operations, but not
> perhaps for large sequences of invocations. (Do we need to worry about this
> though as the API methods usually accept batches?). Additionally tasks such
> as heartbeat management could also be handled implicitly for long running
> sets of operations. With these concerns in mind it may also be beneficial to
> deliver some of the following:
> * A means to automatically acquire/release appropriate locks for
> {{HCatClient}} operations.
> * A component that maintains a lock heartbeat from the client.
> * A strategy for switching between manual/automatic lock management,
> analogous to SQL's {{autocommit}} for transactions.
> An API for lock and heartbeat management already exists in the HCatalog
> Mutation API (see:
> {{org.apache.hive.hcatalog.streaming.mutate.client.lock}}). It will likely
> make sense to refactor either this code and/or code that uses it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)