[jira] [Updated] (SQOOP-3263) Duplicate rows found when split-by column is of textual type due to different charset difference of sqoop and hadoop

Yulei Yang (JIRA) Sun, 26 Nov 2017 07:44:27 -0800

     [ 
https://issues.apache.org/jira/browse/SQOOP-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yulei Yang updated SQOOP-3263:
------------------------------
    Description: 
This is issue can be found in any kind of RMDBS, because the root cause is not 
on RMDBS. Steps to reproduce this issue:
1. create a mysql table: create table ora_test (id varchar(32) primary key not 
null);
2.  insert *5* rows:
insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
insert into ora_test values ('4 ABC');
3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you will 
get *7* rows in hive. Check screenshot-1.png
part-32 is duplicated with part-26.

so I print their split boundary values in unicode and plain text, check 
screenshot-2.png for part-26, screenshot-3.png for part-32.
According to boundary values, we can know that part-26 has no problem while 
part-32 is wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should 
have no records.

So '?' in plain text of part-32 is suspicious, does its unicode is still 
'\ud836' when query on RMDBS？
So I do next test, check screenshot-4.png. Two different unicode characters are 
mapped to a same character in utf-8.
This caused the duplication.

How is happens?
1. split boundary values are unicode
2. when the import MR start to run, it read split boundary values to Text type. 
Text always use utf-8, so some characters are wrong, like above case. 

My solution is convert sqoop generated split boundary values to utf-8 first, 
and resort them.

  was:
This is issue can be found in any kind of RMDBS, because the root cause is not 
on RMDBS. Steps to reproduce this issue:
1. create a mysql table: create table ora_test (id varchar(32) primary key not 
null);
2.  insert *4* rows:
insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you will 
get *6* rows in hive.



> Duplicate rows found when split-by column is of textual type due to different 
> charset difference of sqoop and hadoop
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SQOOP-3263
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3263
>             Project: Sqoop
>          Issue Type: Bug
>    Affects Versions: 1.4.6
>            Reporter: Yulei Yang
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> This is issue can be found in any kind of RMDBS, because the root cause is 
> not on RMDBS. Steps to reproduce this issue:
> 1. create a mysql table: create table ora_test (id varchar(32) primary key 
> not null);
> 2.  insert *5* rows:
> insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064');
> insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8');
> insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7');
> insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA');
> insert into ora_test values ('4 ABC');
> 3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you 
> will get *7* rows in hive. Check screenshot-1.png
> part-32 is duplicated with part-26.
> so I print their split boundary values in unicode and plain text, check 
> screenshot-2.png for part-26, screenshot-3.png for part-32.
> According to boundary values, we can know that part-26 has no problem while 
> part-32 is wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should 
> have no records.
> So '?' in plain text of part-32 is suspicious, does its unicode is still 
> '\ud836' when query on RMDBS？
> So I do next test, check screenshot-4.png. Two different unicode characters 
> are mapped to a same character in utf-8.
> This caused the duplication.
> How is happens?
> 1. split boundary values are unicode
> 2. when the import MR start to run, it read split boundary values to Text 
> type. Text always use utf-8, so some characters are wrong, like above case. 
> My solution is convert sqoop generated split boundary values to utf-8 first, 
> and resort them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SQOOP-3263) Duplicate rows found when split-by column is of textual type due to different charset difference of sqoop and hadoop

Reply via email to