[
https://issues.apache.org/jira/browse/MAPREDUCE-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758087#action_12758087
]
Aaron Kimball commented on MAPREDUCE-1017:
------------------------------------------
In addition to the unit tests added in the {{org.apache.hadoop.sqoop.io}}
package, I also performed a larger-scale test of this functionality. A 1.5 GB
table was imported from MySQL to HDFS; the data was highly redundant, so
compression shrank the files considerably, as well improved as the import time.
The arguments {{\-z \-\-direct-split-size 25000000}} was given, so that it
would generate approximately 25 MB files. This worked, and three files were
generated. I verified using {{head}} and {{tail}} that the files did not lose
any records and that records did not span multiple files.
I also verified that {{\-\-direct-split-size}} worked without compression,
which it does.
> Compression and output splitting for Sqoop
> ------------------------------------------
>
> Key: MAPREDUCE-1017
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1017
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: contrib/sqoop
> Reporter: Aaron Kimball
> Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1017.patch
>
>
> Sqoop "direct mode" writing will generate a single large text file in HDFS.
> It is important to be able to compress this data before it reaches HDFS. Due
> to the difficulty in splitting compressed files in HDFS for use by MapReduce
> jobs, data should also be split at compression time.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.