[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Aaron Kimball (JIRA) Mon, 21 Sep 2009 17:38:48 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758087#action_12758087
 ]


Aaron Kimball commented on MAPREDUCE-1017:
------------------------------------------

In addition to the unit tests added in the {{org.apache.hadoop.sqoop.io}} 
package, I also performed a larger-scale test of this functionality. A 1.5 GB 
table was imported from MySQL to HDFS; the data was highly redundant, so 
compression shrank the files considerably, as well improved as the import time. 
The arguments {{\-z \-\-direct-split-size 25000000}} was given, so that it 
would generate approximately 25 MB files. This worked, and three files were 
generated. I verified using {{head}} and {{tail}} that the files did not lose 
any records and that records did not span multiple files.

I also verified that {{\-\-direct-split-size}} worked without compression, 
which it does. 

> Compression and output splitting for Sqoop
> ------------------------------------------
>
>                 Key: MAPREDUCE-1017
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS. 
> It is important to be able to compress this data before it reaches HDFS. Due 
> to the difficulty in splitting compressed files in HDFS for use by MapReduce 
> jobs, data should also be split at compression time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1017) Compression and output splitting for Sqoop

Reply via email to