[ 
https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337301#comment-16337301
 ] 

Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------

*re: "or every column, but I've already addressed this issue in 
[^SQOOP-3267.1.patch] (see first comment on this issue)."*

Sorry, I have missed this, it is a nice improvement!

Even if we ignore the slight performance overhead the problem with the default 
null string could be that the output HBase table of a regular import would be 
different (we would get defined columns with empty strings instead of undefined 
columns) and this behavior change is a bit unexpected from a bug JIRA. It would 
solve this particular bug but could lead to confusion in the future.

I am not sure I understand how you would split up the work between the two 
JIRAs and I wasn't really clear in my previous comment so let me summarize what 
I suggest:
 * This JIRA would add the --hbase-null-incremental-mode option with two 
possible values: ignore(default) and delete. This would basically restore the 
behavior we had prior to SQOOP-3149 but it would keep the intended 
functionality introduced by it. It would be a pretty much localized change we 
would not affect users who do not even do incremental imports.
 * Another JIRA would introduce a new possible value (null-string) to 
--hbase-null-incremental-mode and a new option --hbase-null-string to specify 
its value. I think this change should be classified as a new feature. 
--hbase-null-string could be usable with regular imports too, but if the user 
does not specify it we should stick to the current behavior and not insert any 
null string to the columns which have nulls in the RDBMS.

 

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>            Priority: Major
>         Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last 
> version of a column when the corresponding cell was set to NULL in the source 
> table.
> This can lead to unexpected and misleading results if the row has been 
> transferred multiple times, which can easily happen if it's being modified on 
> the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a 
> single Put per row as before. This could probably lead to a performance drop 
> for wide tables (for which HBase is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be 
> the expected behavior here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to