[ https://issues.apache.org/jira/browse/SQOOP-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335941#comment-16335941 ]
Szabolcs Vasas commented on SQOOP-3267: --------------------------------------- Hi [~dvoros], Option B seems to be a good direction to me, I agree that ideally the target HBase table should reflect that a column is set to null in the source RDBMS and I would not make this dependant on incremental mode since in theory only "lastmodified" mode should change already existing rows in the target table. However after thinking about this more thouroughly a performance related questions have arisen. Let's say the users want to import new rows (so it would be a regular import not an incremental one) from a wide table where most of the columns are nulls only a couple of values are defined. In this case the current implementation would use only a few Put commands but the suggested implementation would need significantly more Put commands just to add the null strings to the HBase table. I think this is something the users would not prefer in this case. On the other hand you are right that it would be great if we could keep the consistency of how we represent nulls in the HBase table and it would not be different in case of regular import and incremental import... Considering the above I suggest the following solution: * Introduce an --hbase-null-incremental-mode(or similar name) option which would enable the users to specify what should Sqoop do with the null values in the source RDBMS table. The options could be: ** ignore (default) - This would be basically the behavior before SQOOP-3149 ** delete - This would be similar to the behavior introduced in SQOOP-3149 but we would delete the whole history ** null-string - Sqoop would put a null string value instead of null specified in the new --hbase-null-string option * Introduce a new option called --hbase-null-string which could be used to specify which null string Sqoop should put into the HBase table instead of null. This could be used for the regular imports too but if it is not specified Sqoop should not use null strings to avoid the above mentioned potential performance problem. The benefit of this solution would be that the users would have more possibilities to control how the null values are handled and it would not change the behavior unexpectedly (I might be paranoid but I feel introducing the new --hbase-null-string is safer than overloading the already existing --null-string). Implementing this might be an overkill for addressing this bug we could move the null-string handling part to another Jira as well. > Incremental import to HBase deletes only last version of column > --------------------------------------------------------------- > > Key: SQOOP-3267 > URL: https://issues.apache.org/jira/browse/SQOOP-3267 > Project: Sqoop > Issue Type: Bug > Components: hbase-integration > Affects Versions: 1.4.7 > Reporter: Daniel Voros > Assignee: Daniel Voros > Priority: Major > Attachments: SQOOP-3267.1.patch > > > Deletes are supported since SQOOP-3149, but we're only deleting the last > version of a column when the corresponding cell was set to NULL in the source > table. > This can lead to unexpected and misleading results if the row has been > transferred multiple times, which can easily happen if it's being modified on > the source side. > Also SQOOP-3149 is using a new Put command for every column instead of a > single Put per row as before. This could probably lead to a performance drop > for wide tables (for which HBase is otherwise usually recommended). > [~jilani], [~anna.szonyi] could you please comment on what you think would be > the expected behavior here? -- This message was sent by Atlassian JIRA (v7.6.3#76005)