[ 
https://issues.apache.org/jira/browse/HIVE-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliot West updated HIVE-10165:
-------------------------------
    Description: 
h3. Overview
I'd like to extend the 
[hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
 API so that it also supports the writing of record updates and deletes in 
addition to the already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into 
existing datasets. Traditionally we achieve this by: reading in a ground-truth 
dataset and a modified dataset, grouping by a key, sorting by a sequence and 
then applying a function to determine inserted, updated, and deleted rows. 
However, in our current scheme we must rewrite all partitions that may 
potentially contain changes. In practice the number of mutated records is very 
small when compared with the records contained in a partition. This approach 
results in a number of operational issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are 
being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention 
is high. 

I believe we can address this problem by instead writing only the changed 
records to a Hive transactional table. This should drastically reduce the 
amount of data that we need to write and also provide a means for managing 
concurrent access to the data. Our existing merge processes can read and retain 
each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an 
updated form of the hive-hcatalog-streaming API which will then have the 
required data to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to 
processes that operate outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying 
{{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by 
third-parties outside of the package. We've also updated the user facing 
interfaces to provide update and delete functionality. I've provided the 
modifications as three incremental patches. Generally speaking, each patch 
makes the API less backwards compatible but more consistent with respect to 
offering updates, deletes as well as writes (inserts). Ideally I hope that all 
three patches have merit, but only the first patch is absolutely necessary to 
enable the features we need on the API, and it does so in a backwards 
compatible way. I'll summarise the contents of each patch:

h4. [^HIVE-10165.0.patch] - Required
This patch contains what we consider to be the minimum amount of changes 
required to allow users to create {{RecordWriter}} subclasses that can insert, 
update, and  delete records. These changes also maintain backwards 
compatibility at the expense of confusing the API a little. Note that the row 
representation has be changed from {{byte[]}} to {{Object}}. Within our data 
processing jobs our records are often available in a strongly typed and decoded 
form such as a POJO or a Tuple object. Therefore is seems to make sense that we 
are able to pass this through to the {{OrcRecordUpdater}} without having to go 
through a {{byte[]}} encoding step. This of course still allows users to use 
{{byte[]}} if they wish.

h4. [^HIVE-10165.1.patch] - Nice to have
This patch builds on the changes made in the *required* patch and aims to make 
the API cleaner and more consistent while accommodating updates and inserts. It 
also adds some logic to prevent the user from submitting multiple operation 
types to a single {{TransactionBatch}} as we found this creates data 
inconsistencies within the Hive table. This patch breaks backwards 
compatibility.

h4. [^HIVE-10165.2.patch] - Nomenclature
This final patch simply renames some of existing types to more accurately 
convey their increased responsibilities. The API is no longer writing just new 
records, it is now also responsible for writing operations that are applied to 
existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is 
intended as an illustration only: [^ReflectiveOperationWriter.java]

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully 
expect that these are required and will work on these if these patches have 
merit.

*Note: Attachments to follow.*

  was:
h3. Overview
I'd like to extend the 
[hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
 API so that it also supports the writing of record updates and deletes in 
addition to the already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into 
existing datasets. Traditionally we achieve this by: reading in a ground-truth 
dataset and a modified dataset, grouping by a key, sorting by a sequence and 
then applying a function to determine inserted, updated, and deleted rows. 
However, in our current scheme we must rewrite all partitions that may 
potentially contain changes. In practice the number of mutated records is very 
small when compared with the records contained in a partition. This approach 
results in a number of operational issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are 
being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention 
is high. 

I believe we can address this problem by instead writing only the changed 
records to a Hive transactional table. This should drastically reduce the 
amount of data that we need to write and also provide a means for managing 
concurrent access to the data. Our existing merge processes can read and retain 
each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an 
updated form of the hive-hcatalog-streaming API which will then have the 
required data to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to 
processes that operate outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying 
{{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by 
third-parties outside of the package. We've also updated the user facing 
interfaces to provide update and delete functionality. I've provided the 
modifications as three incremental patches. Generally speaking, each patch 
makes the API less backwards compatible but more consistent with respect to 
offering updates, deletes as well as writes (inserts). Ideally I hope that all 
three patches have merit, but only the first patch is absolutely necessary to 
enable the features we need on the API, and it does so in a backwards 
compatible way. I'll summarise the contents of each patch:

h4. [^HIVE-10165.0.patch] - Required
This patch contains what we consider to be the minimum amount of changes 
required to allow users to create {{RecordWriter}} subclasses that can insert, 
update, and  delete records. These changes also maintain backwards 
compatibility at the expense of confusing the API a little. Note that the row 
representation has be changed from {{byte[]}} to {{Object}}. Within our data 
processing jobs our records are often available in a strongly typed and decoded 
form such as a POJO or a Tuple object. Therefore is seems to make sense that we 
are able to pass this through to the {{OrcRecordUpdater}} without having to go 
through a {{byte[]}} encoding step. This our course still allows users to use 
{{byte[]}} if they wish.

h4. [^HIVE-10165.1.patch] - Nice to have
This patch builds on the changes made in the *required* patch and aims to make 
the API cleaner and more consistent while accommodating updates and inserts. It 
also adds some logic to prevent the user from submitting multiple operation 
types to a single {{TransactionBatch}} as we found this creates data 
inconsistencies within the Hive table. This patch breaks backwards 
compatibility.

h4. [^HIVE-10165.2.patch] - Nomenclature
This final patch simply renames some of existing types to more accurately 
convey their increased responsibilities. The API is no longer writing just new 
records, it is now also responsible for writing operations that are applied to 
existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is 
intended as an illustration only: [^ReflectiveOperationWriter.java]

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully 
expect that these are required and will work on these if these patches have 
merit.



> Improve hive-hcatalog-streaming extensibility and support updates and deletes.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-10165
>                 URL: https://issues.apache.org/jira/browse/HIVE-10165
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Elliot West
>            Assignee: Alan Gates
>              Labels: streaming_api
>             Fix For: 1.2.0
>
>         Attachments: HIVE-10165.0.patch, HIVE-10165.1.patch, 
> HIVE-10165.2.patch, ReflectiveOperationWriter.java
>
>
> h3. Overview
> I'd like to extend the 
> [hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
>  API so that it also supports the writing of record updates and deletes in 
> addition to the already supported inserts.
> h3. Motivation
> We have many Hadoop processes outside of Hive that merge changed facts into 
> existing datasets. Traditionally we achieve this by: reading in a 
> ground-truth dataset and a modified dataset, grouping by a key, sorting by a 
> sequence and then applying a function to determine inserted, updated, and 
> deleted rows. However, in our current scheme we must rewrite all partitions 
> that may potentially contain changes. In practice the number of mutated 
> records is very small when compared with the records contained in a 
> partition. This approach results in a number of operational issues:
> * Excessive amount of write activity required for small data changes.
> * Downstream applications cannot robustly read these datasets while they are 
> being updated.
> * Due to scale of the updates (hundreds or partitions) the scope for 
> contention is high. 
> I believe we can address this problem by instead writing only the changed 
> records to a Hive transactional table. This should drastically reduce the 
> amount of data that we need to write and also provide a means for managing 
> concurrent access to the data. Our existing merge processes can read and 
> retain each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to 
> an updated form of the hive-hcatalog-streaming API which will then have the 
> required data to perform an update or insert in a transactional manner. 
> h3. Benefits
> * Enables the creation of large-scale dataset merge processes  
> * Opens up Hive transactional functionality in an accessible manner to 
> processes that operate outside of Hive.
> h3. Implementation
> We've patched the API to provide visibility to the underlying 
> {{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by 
> third-parties outside of the package. We've also updated the user facing 
> interfaces to provide update and delete functionality. I've provided the 
> modifications as three incremental patches. Generally speaking, each patch 
> makes the API less backwards compatible but more consistent with respect to 
> offering updates, deletes as well as writes (inserts). Ideally I hope that 
> all three patches have merit, but only the first patch is absolutely 
> necessary to enable the features we need on the API, and it does so in a 
> backwards compatible way. I'll summarise the contents of each patch:
> h4. [^HIVE-10165.0.patch] - Required
> This patch contains what we consider to be the minimum amount of changes 
> required to allow users to create {{RecordWriter}} subclasses that can 
> insert, update, and  delete records. These changes also maintain backwards 
> compatibility at the expense of confusing the API a little. Note that the row 
> representation has be changed from {{byte[]}} to {{Object}}. Within our data 
> processing jobs our records are often available in a strongly typed and 
> decoded form such as a POJO or a Tuple object. Therefore is seems to make 
> sense that we are able to pass this through to the {{OrcRecordUpdater}} 
> without having to go through a {{byte[]}} encoding step. This of course still 
> allows users to use {{byte[]}} if they wish.
> h4. [^HIVE-10165.1.patch] - Nice to have
> This patch builds on the changes made in the *required* patch and aims to 
> make the API cleaner and more consistent while accommodating updates and 
> inserts. It also adds some logic to prevent the user from submitting multiple 
> operation types to a single {{TransactionBatch}} as we found this creates 
> data inconsistencies within the Hive table. This patch breaks backwards 
> compatibility.
> h4. [^HIVE-10165.2.patch] - Nomenclature
> This final patch simply renames some of existing types to more accurately 
> convey their increased responsibilities. The API is no longer writing just 
> new records, it is now also responsible for writing operations that are 
> applied to existing records. This patch breaks backwards compatibility.
> h3. Example
> I've attached simple typical usage of the API. This is not a patch and is 
> intended as an illustration only: [^ReflectiveOperationWriter.java]
> h3. Known issues
> I have not yet provided any unit tests for the extended functionality. I 
> fully expect that these are required and will work on these if these patches 
> have merit.
> *Note: Attachments to follow.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to