[ 
https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335941#comment-16335941
 ] 

Szabolcs Vasas commented on SQOOP-3267:
---------------------------------------

Hi [~dvoros],

Option B seems to be a good direction to me, I agree that ideally the target 
HBase table should reflect that a column is set to null in the source RDBMS and 
I would not make this dependant on incremental mode since in theory only 
"lastmodified" mode should change already existing rows in the target table.
However after thinking about this more thouroughly a performance related 
questions have arisen. Let's say the users want to import new rows (so it would 
be a regular import not an incremental one) from a wide table where most of the 
columns are nulls only a couple of values are defined. In this case the current 
implementation would use only a few Put commands but the suggested 
implementation would need significantly more Put commands just to add the null 
strings to the HBase table. I think this is something the users would not 
prefer in this case. On the other hand you are right that it would be great if 
we could keep the consistency of how we represent nulls in the HBase table and 
it would not be different in case of regular import and incremental import...
Considering the above I suggest the following solution:
 * Introduce an --hbase-null-incremental-mode(or similar name) option which 
would enable the users to specify what should Sqoop do with the null values in 
the source RDBMS table. The options could be:
 ** ignore (default) - This would be basically the behavior before SQOOP-3149
 ** delete - This would be similar to the behavior introduced in SQOOP-3149 but 
we would delete the whole history
 ** null-string - Sqoop would put a null string value instead of null specified 
in the new --hbase-null-string option
 * Introduce a new option called --hbase-null-string which could be used to 
specify which null string Sqoop should put into the HBase table instead of 
null. This could be used for the regular imports too but if it is not specified 
Sqoop should not use null strings to avoid the above mentioned potential 
performance problem.

The benefit of this solution would be that the users would have more 
possibilities to control how the null values are handled and it would not 
change the behavior unexpectedly (I might be paranoid but I feel introducing 
the new --hbase-null-string is safer than overloading the already existing 
--null-string).

Implementing this might be an overkill for addressing this bug we could move 
the null-string handling part to another Jira as well.

 

> Incremental import to HBase deletes only last version of column
> ---------------------------------------------------------------
>
>                 Key: SQOOP-3267
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3267
>             Project: Sqoop
>          Issue Type: Bug
>          Components: hbase-integration
>    Affects Versions: 1.4.7
>            Reporter: Daniel Voros
>            Assignee: Daniel Voros
>            Priority: Major
>         Attachments: SQOOP-3267.1.patch
>
>
> Deletes are supported since SQOOP-3149, but we're only deleting the last 
> version of a column when the corresponding cell was set to NULL in the source 
> table.
> This can lead to unexpected and misleading results if the row has been 
> transferred multiple times, which can easily happen if it's being modified on 
> the source side.
> Also SQOOP-3149 is using a new Put command for every column instead of a 
> single Put per row as before. This could probably lead to a performance drop 
> for wide tables (for which HBase is otherwise usually recommended).
> [~jilani], [~anna.szonyi] could you please comment on what you think would be 
> the expected behavior here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to