[ https://issues.apache.org/jira/browse/SQOOP-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yulei Yang updated SQOOP-3263: ------------------------------ Attachment: sqoop-3263.patch > Duplicate rows found when split-by column is of textual type due to different > charset difference of sqoop and hadoop > -------------------------------------------------------------------------------------------------------------------- > > Key: SQOOP-3263 > URL: https://issues.apache.org/jira/browse/SQOOP-3263 > Project: Sqoop > Issue Type: Bug > Affects Versions: 1.4.6 > Reporter: Yulei Yang > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png, sqoop-3263.patch > > > This is issue can be found in any kind of RMDBS, because the root cause is > not on RMDBS. Steps to reproduce this issue: > 1. create a mysql table: create table ora_test (id varchar(32) primary key > not null); > 2. insert *5* rows: > insert into ora_test values ('08125FC4C8FDA064E053C0A8028DA064'); > insert into ora_test values ('4FFE68419D3502E2E0537F000001F3E8'); > insert into ora_test values ('4FFF9CF5861E003EE0537F0000017FF7'); > insert into ora_test values ('56DAC2D0F14901B0E0537F000001D3FA'); > insert into ora_test values ('4 ABC'); > 3. import it to hive with sqoop import -m 32. (m=189 is also ok)。 Then you > will get *7* rows in hive. Check screenshot-1.png > part-32 is duplicated with part-26. > so I print their split boundary values in unicode and plain text, check > screenshot-2.png for part-26, screenshot-3.png for part-32. > According to boundary values, we can know that part-26 has no problem while > part-32 is wrong, because '\u4\ud836' is larger than ‘4F', so part-32 should > have no records. > So '?' in plain text of part-32 is suspicious, does its unicode is still > '\ud836' when query on RMDBS? > So I do next test, check screenshot-4.png. Two different unicode characters > are mapped to a same character in utf-8. > This caused the duplication. > How is happens? > 1. split boundary values are unicode > 2. when the import MR start to run, it read split boundary values to Text > type. Text always use utf-8, so some characters are wrong, like above case. > My solution is convert sqoop generated split boundary values to utf-8 first, > and resort them. -- This message was sent by Atlassian JIRA (v6.4.14#64029)