[jira] [Commented] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.

Alan Gates (JIRA) Thu, 18 Jun 2015 09:03:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592029#comment-14592029
 ]


Alan Gates commented on HIVE-10165:
-----------------------------------

bq. I wanted the ability to mock them in the TestMutatorCoordinator test. They 
are package private, so this separation doesn't leak into the public API.
If this is undesirable, can you recommend an alternative approach?
That's fine.  I think comments to reflect that those arguments are only for 
testing purposes would be helpful.

bq. This class relies on the correct grouping of the data (by partition,bucket) 
to avoid the problem that you describe. ... Very keen to hear your thoughts on 
this.
I am fine with pushing this responsibility to the client.  But the following in 
the class javadoc is confusing.  It starts by saying {{Events must be grouped 
by partition, then bucket}} but then later says {{Events are free to target any 
bucket and partition, including new partitions if {@link 
MutatorDestination#createPartitions()} is set. Internally the coordinator 
creates and closes {@link Mutator Mutators} as needed to write to the 
appropriate partition and bucket.}} This latter makes it sound like random 
order is ok.  I think you're trying to say "group by partition bucket, and the 
MutatorCoordinator will seamlessly handle the transitions between groups".  Is 
that right?  I think we should be very clear to users that there is an extreme 
performance and storage penalty for jumping around in random order.

bq. I now wonder whether the work I’m doing in UgiMetaStoreClientFactory is 
already available in an existing Hive class as it seems like a common 
requirement. Can you advise?
There are a number of places Hive does UGI calls, but I'm not aware of any 
where it does them for metastore calls.

At this point the only issues I see remaining to get this committed is the two 
javadoc comments I've pointed out above.

> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>    Affects Versions: 1.2.0
>            Reporter: Elliot West
>            Assignee: Elliot West
>              Labels: streaming_api
>         Attachments: HIVE-10165.0.patch, HIVE-10165.4.patch, 
> HIVE-10165.5.patch, HIVE-10165.6.patch, HIVE-10165.7.patch, 
> mutate-system-overview.png
>
>
> h3. Overview
> I'd like to extend the 
> [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
>  API so that it also supports the writing of record updates and deletes in 
> addition to the already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into 
> existing datasets. Traditionally we achieve this by: reading in a 
> ground-truth dataset and a modified dataset, grouping by a key, sorting by a 
> sequence and then applying a function to determine inserted, updated, and 
> deleted rows. However, in our current scheme we must rewrite all partitions 
> that may potentially contain changes. In practice the number of mutated 
> records is very small when compared with the records contained in a 
> partition. This approach results in a number of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are 
> being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for 
> contention is high. 
> I believe we can address this problem by instead writing only the changed 
> records to a Hive transactional table. This should drastically reduce the 
> amount of data that we need to write and also provide a means for managing 
> concurrent access to the data. Our existing merge processes can read and 
> retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to 
> an updated form of the hive-hcatalog-streaming API which will then have the 
> required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to 
> processes that operate outside of Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.

Reply via email to