[
https://issues.apache.org/jira/browse/SQOOP-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231912#comment-14231912
]
Vinoth Chandar commented on SQOOP-1744:
---------------------------------------
[~rdblue]
>> Can you put any bounds on what records might change?
We have our own usage patterns. But, I don't think we can expect only records
in the last 5 minutes to change, even for typical OLTP workloads, right (eg:
Uber table, profile table, etc)..
>> Actually, we can select a subset of the records in HBase and copy them to
>> Parquet
Not sure I explained myself clearly... let me take another shot..
Once we do a full fetch, we could do something to like below, for the
subsequent incremental fetch :
(Assume : We did a select * from users; and produced a number parquet files
that contain records from a User table, rows organized by the table pk userid)
1) Obtain all rows that changes since last run.
2) Write those rows into HBase to merge.
3) Then pull them out again & rewrite the affected parquet files.
But, in this step 2 does not buy us anything, right? Since we stiil need to do
the work of identifying the affected parquet files and overwrite only those
affected. Thats why I was saying, only if you convert the whole dataset from
HFile to parquet, you get an out-of-the-box solution..
May be I am missing something?
> TO-side: Write data to HBase
> ----------------------------
>
> Key: SQOOP-1744
> URL: https://issues.apache.org/jira/browse/SQOOP-1744
> Project: Sqoop
> Issue Type: Sub-task
> Components: connectors
> Reporter: Qian Xu
> Assignee: Qian Xu
> Fix For: 1.99.5
>
>
> Propose to write data into HBase. Note that different to HDFS, HBase is
> append only. Merge does not work for HBase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)