[
https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596104#comment-14596104
]
Elliot West commented on HIVE-10165:
------------------------------------
I've submitted a patch ([^HIVE-10165.9.patch]) that includes javadoc comments
to call out which members are visible/present purely for the purposes of
testing. I also made a start on underlining the serious drawbacks of
inappropriate grouping of records, namely that mutators would have to be
repeatedly opened and closed. However, on investigating this further I believe
that what was a recommended grouping is in fact mandatory. It appears that one
cannot reopen a closed ORC delta file with the {{OrcRecordUpdater}}. If I'm
reading the code correctly,
{{org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream():2103}} does not allow
the file to be reopened for append or even overwritten. Therefore, if we are to
avoid keeping open a file for each group, it is imperative that all records in
a (partition, bucket) group are processed in a contiguous manner.
I have updated the javadoc and package HTML to reflect this and have also
implemented aq {{GroupingValidator}} to enforce this constraint.
> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
> Key: HIVE-10165
> URL: https://issues.apache.org/jira/browse/HIVE-10165
> Project: Hive
> Issue Type: Improvement
> Components: HCatalog
> Affects Versions: 1.2.0
> Reporter: Elliot West
> Assignee: Elliot West
> Labels: streaming_api
> Attachments: HIVE-10165.0.patch, HIVE-10165.4.patch,
> HIVE-10165.5.patch, HIVE-10165.6.patch, HIVE-10165.7.patch,
> HIVE-10165.9.patch, mutate-system-overview.png
>
>
> h3. Overview
> I'd like to extend the
> [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
> API so that it also supports the writing of record updates and deletes in
> addition to the already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into
> existing datasets. Traditionally we achieve this by: reading in a
> ground-truth dataset and a modified dataset, grouping by a key, sorting by a
> sequence and then applying a function to determine inserted, updated, and
> deleted rows. However, in our current scheme we must rewrite all partitions
> that may potentially contain changes. In practice the number of mutated
> records is very small when compared with the records contained in a
> partition. This approach results in a number of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are
> being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for
> contention is high.
> I believe we can address this problem by instead writing only the changed
> records to a Hive transactional table. This should drastically reduce the
> amount of data that we need to write and also provide a means for managing
> concurrent access to the data. Our existing merge processes can read and
> retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to
> an updated form of the hive-hcatalog-streaming API which will then have the
> required data to perform an update or insert in a transactional manner.
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes
> * Opens up Hive transactional functionality in an accessible manner to
> processes that operate outside of Hive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)