Sqoop Documentation about --split-by column has to be unique key seems to be 
wrong
----------------------------------------------------------------------------------

                 Key: MAPREDUCE-1449
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1449
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: contrib/sqoop
    Affects Versions: 0.20.1
            Reporter: mingran wang


http://archive.cloudera.com/docs/sqoo... 

The document above shows that " To guarantee correctness of your input, you 
must select an ordering column for which each row has a unique value. If 
duplicate values appear in the ordering column, the results of the import are 
undefined, and Sqoop will not be able to detect the error." 

I read the source code for sqoop, it seems that the column to split by doesn't 
have to be a unique key. Plus, when the primary key is a composite key, the 
sqoop code only takes the first column of the composite key which in most cases 
is not unique key anyways. 

I also checked the output when non-unique key is used to split, there is 
nothing wrong with the result. 

I am wondering if the document is wrong, or there is some hidden trickiness 
that I am not aware of. 

I am using sqoop 20.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to