[jira] [Commented] (SQOOP-3262) Duplicate rows found when split-by column is of type String

Yulei Yang (JIRA) Sun, 26 Nov 2017 08:14:47 -0800

    [ 
https://issues.apache.org/jira/browse/SQOOP-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266081#comment-16266081
 ]


Yulei Yang commented on SQOOP-3262:
-----------------------------------

If a user want to use this patch, the usage is:  sqoop import -D 
org.apache.sqoop.db.type=mysql

> Duplicate rows found when split-by column is of type String
> -----------------------------------------------------------
>
>                 Key: SQOOP-3262
>                 URL: https://issues.apache.org/jira/browse/SQOOP-3262
>             Project: Sqoop
>          Issue Type: Bug
>          Components: connectors/generic
>    Affects Versions: 1.4.6
>            Reporter: Yulei Yang
>         Attachments: sqoop_3262.patch
>
>
> When using string(or char) type column as split-by column, sometimes we found 
> duplicate rows, usually this is caused by source RMDBS is case insensitive 
> when do comparison. Here is a case, (split query sql):
> 1. where id >='A' and id < 'E'   
> 2. where id >='a' and id < 'e' 
> if the RMDBS is CI, these two different split will get same result, thus 
> caused duplication.
> By default oracle and db2 is CS, but a DBA can change it. so we need to check 
> it before import. 
> By default，sql server is CI，solution is --split-by '<your_column> collate 
> xxx_collation', like Chinese_PRC_Bin.
> By default intersystems cachedb is CI，solution is --split-by 
> "%sqlstring(<your_column>)".
> Mysql is CI by default, but “--split-by 'binary <your_column>' ” is throwing 
> below exception:
> ERROR tool.ImportTool: Encountered IOException running import job: 
> java.io.IOException: Sqoop does not have the splitter for the given SQL data 
> type. Please use either different split column (argument --split-by) or lower 
> the number of mappers to 1. Unknown SQL data type: -3
>         at 
> org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:165)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at 
> org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196)
>         at 
> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169)
>         at 
> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266)
>         at 
> org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:729)
>         at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
>         at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
>         at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
>         at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
> I have apply a patch for mysql's case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SQOOP-3262) Duplicate rows found when split-by column is of type String

Reply via email to