[
https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292487#comment-16292487
]
Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------
Hi [~dvoros],
Thank you for reporting this bug, this is a really nice catch!
I have created a couple of test cases to clarify the issue.
1) Updating a column without history in HBase
Let's create the test tables and import the initial values to HBase:
{code:java}
export MYCONN=...
export MYUSER=...
sqoop eval --connect $MYCONN --username $MYUSER --query "drop table hbase_test"
sqoop eval --connect $MYCONN --username $MYUSER --query "create table
hbase_test(id int primary key ,name varchar(30), date_modified DATETIME)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test
(id, name, date_modified) VALUES(1, 'name1', CURRENT_TIMESTAMP)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test
(id, name, date_modified) VALUES(2, 'name2', CURRENT_TIMESTAMP)"
echo "disable 'hbase_test'" | hbase shell
echo "drop 'hbase_test'" | hbase shell
echo "create 'hbase_test', 'data'" | hbase shell
sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test"
--hbase-table hbase_test --column-family data -m 1
echo "scan 'hbase_test'" | hbase shell
scan 'hbase_test'
ROW COLUMN+CELL
1
column=data:date_modified, timestamp=1513340516549, value=2017-12-15 12:17:26.0
1 column=data:name,
timestamp=1513340516549, value=name1
2
column=data:date_modified, timestamp=1513340516549, value=2017-12-15 12:17:29.0
2 column=data:name,
timestamp=1513340516549, value=name2
2 row(s)
{code}
The output of the HBase scan is the expected, let's update a column to null in
the database and execute an incremental import:
{code:java}
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set
name=null, date_modified=CURRENT_TIMESTAMP where id = 1"
sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test"
--hbase-table hbase_test --column-family data -m 1 --incremental lastmodified
--check-column date_modified --last-value "2017-12-15 12:17:26.0" --merge-key id
echo "scan 'hbase_test'" | hbase shell
scan 'hbase_test'
ROW COLUMN+CELL
1
column=data:date_modified, timestamp=1513340614595, value=2017-12-15 12:23:00.0
2
column=data:date_modified, timestamp=1513340614595, value=2017-12-15 12:17:29.0
2 column=data:name,
timestamp=1513340614595, value=name2
2 row(s)
{code}
The output of the HBase scan is pretty much the behavior expected after
SQOOP-3149, the name column is updated to null, so it "disappeared" from the
output of the scan, if we check that column in HBase using the Java API we will
get null value. So far so good.
2) Updating a column with history in HBase
This test case is really similar to the previous one but first we update the
name column to a non-null value:
{code:java}
export MYCONN=jdbc:mysql://`hostname`/test
export MYUSER=sqoop
sqoop eval --connect $MYCONN --username $MYUSER --query "drop table hbase_test"
sqoop eval --connect $MYCONN --username $MYUSER --query "create table
hbase_test(id int primary key ,name varchar(30), date_modified DATETIME)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test
(id, name, date_modified) VALUES(1, 'name1', CURRENT_TIMESTAMP)"
sqoop eval --connect $MYCONN --username $MYUSER --query "INSERT INTO hbase_test
(id, name, date_modified) VALUES(2, 'name2', CURRENT_TIMESTAMP)"
echo "disable 'hbase_test'" | hbase shell
echo "drop 'hbase_test'" | hbase shell
echo "create 'hbase_test', 'data'" | hbase shell
sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test"
--hbase-table hbase_test --column-family data -m 1
echo "scan 'hbase_test'" | hbase shell
scan 'hbase_test'
ROW COLUMN+CELL
1
column=data:date_modified, timestamp=1513335407294, value=2017-12-15 10:49:05.0
1 column=data:name,
timestamp=1513335407294, value=name1
2
column=data:date_modified, timestamp=1513335407294, value=2017-12-15 10:49:09.0
2 column=data:name,
timestamp=1513335407294, value=name2
2 row(s)
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set
name='name_modified', date_modified=CURRENT_TIMESTAMP where id = 1"
sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test"
--hbase-table hbase_test --column-family data -m 1 --incremental lastmodified
--check-column date_modified --last-value "2017-12-15 10:49:09.0" --merge-key id
echo "scan 'hbase_test'" | hbase shell
scan 'hbase_test'
ROW COLUMN+CELL
1
column=data:date_modified, timestamp=1513335602892, value=2017-12-15 10:58:44.0
1 column=data:name,
timestamp=1513335602892, value=name_modified
2
column=data:date_modified, timestamp=1513335602892, value=2017-12-15 10:49:09.0
2 column=data:name,
timestamp=1513335602892, value=name2
2 row(s)
{code}
Everything is expected so far, the output of the second HBase scan command
shows us the updated name value. Now let's set that column to null and see what
happens in HBase:
{code:java}
sqoop eval --connect $MYCONN --username $MYUSER --query "update hbase_test set
name=null, date_modified=CURRENT_TIMESTAMP where id = 1"
sqoop import --connect $MYCONN --username $MYUSER --table "hbase_test"
--hbase-table hbase_test --column-family data -m 1 --incremental lastmodified
--check-column date_modified --last-value "2017-12-15 10:58:44.0" --merge-key id
echo "scan 'hbase_test'" | hbase shell
scan 'hbase_test'
ROW COLUMN+CELL
1
column=data:date_modified, timestamp=1513335804486, value=2017-12-15 11:01:35.0
1 column=data:name,
timestamp=1513335407294, value=name1
2
column=data:date_modified, timestamp=1513335602892, value=2017-12-15 10:49:09.0
2 column=data:name,
timestamp=1513335602892, value=name2
2 row(s)
Took 0.4017 seconds
{code}
The output of the last scan shows the bug reported in this JIRA. We can see
that the data:date_modified column is updated in row 1, but the value data:name
column is not null but it is "reset" to its previous value.
I think we all agree that the current behavior is not what we want the question
is how we should address this bug.
I am not a big fan of deleting the full history, since some users might rely on
it, if it is possible we should put a null value to the HBase table as well.
Regards,
Szabolcs
> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
> Key: SQOOP-3267
> URL: https://issues.apache.org/jira/browse/SQOOP-3267
> Project: Sqoop
> Issue Type: Bug
> Components: hbase-integration
> Affects Versions: 1.4.7
> Reporter: Daniel Voros
> Assignee: Daniel Voros
> Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last
> version of a column when the corresponding cell was set to NULL in the source
> table.
> This can lead to unexpected and misleading results if the row has been
> transferred multiple times, which can easily happen if it's being modified on
> the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a
> single Put per row as before. This could probably lead to a performance drop
> for wide tables (for which HBase is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be
> the expected behavior here?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)