[
https://issues.apache.org/jira/browse/MAPREDUCE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727100#action_12727100
]
Aaron Kimball commented on MAPREDUCE-705:
-----------------------------------------
I have implemented delimiter control for Sqoop. Users can now customize which
characters are used as delimiters between fields and records. Fields can also
be enclosed in user-configurable quote characters, which can be made mandatory
(all fields always quoted) or optional (fields only quoted when they contain a
delimiter). An optional user-configurable escape character allows users to
escape quotes (and more escape characters) inside fields.
This modifies the toString() method of generated classes to use the proper
quote, delimiter, and escape characters for each record.
Sqoop-generated record types now also contain a parse() method which can read a
record separated by the specified field delimiters and terminated by the
specified record-termination delimiter, and can contain escapes and quotes as
described above. The record will repopulate all its fields by reading the
record. For example, suppose a table was created with:
{{CREATE TABLE mytable (foo INT, bar VARCHAR(32), baz DATE);}}
Importing this table from the database into HDFS via Sqoop would generate a
class named {{mytable}} which can contain a single record consisting of
{{foo}}, {{bar}}, and {{baz}} fields. The text file emitted in HDFS may contain
lines like:
{{42,'hello','2009-07-04'}}
The {{mytable.parse()}} method will read a line of this form and populate its
{{foo}}, {{bar}} and {{baz}} variables based on the three fields it can
interpret.
The delimiters used when writing text to HDFS are controlled by these options:
{code}
Output line formatting options:
--fields-terminated-by (char) Sets the field separator character
--lines-terminated-by (char) Sets the end-of-line character
--optionally-enclosed-by (char) Sets a field enclosing character
--enclosed-by (char) Sets a required field enclosing char
--escaped-by (char) Sets the escape character
--mysql-delimiters Uses MySQL's default delimiter set
fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
{code}
The {{(char)}} argument above can be specified either as a normal character
(e.g., {{\-\-fields-terminated-by ,}}) or via an escape sequence. Arguments of
the form {{\0xhhh}} will be interpreted as a hexidecimal representation of a
character with hex number _hhh_. Arguments of the form {{\0ooo}} will be
treated as an octal representation of a character represented by octal number
_ooo_. The special escapes {{\n}}, {{\r}}, {{\"}}, and {{\b}} as well as two
backslashes (which JIRA refuses to format...) act as they do inside Java
strings. {{\0}} will be treated as {{NUL}}. This will insert {{NUL}} characters
between fields or lines ((if used for {{\-\-fields-terminated-by}} or
{{\-\-lines-terminated-by}}), or will disable enclosing/escaping if used for
one of the {{\-\-enclosed-by}}, {{\-\-optionally-enclosed-by}}, or
{{\-\-escaped-by}} arguments. Enclosing and escaping are disabled by default.
For unambiguous parsing, both must be enabled, e.g., via
{{\-\-mysql-delimiters}}.
Sqoop can also transcode one delimiter set into another. So if your data is
currently enclosed in a certain set of delimiters, you can use Sqoop to help
you convert it to a new set of delimiters.
The options described above will set the output delimiters used by
{{toString()}}. If none of the following options are used, they also set the
input delimiters used by {{parse()}}. But the input delimiters can be
overridden on an individual basis with the options:
{code}
Input parsing options:
--input-fields-terminated-by (char) Sets the input field separator
--input-lines-terminated-by (char) Sets the input end-of-line char
--input-optionally-enclosed-by (char) Sets a field enclosing character
--input-enclosed-by (char) Sets a required field encloser
--input-escaped-by (char) Sets the input escape character
{code}
If you use these options, then the {{parse()}} method generated for your type
will use these delimiters, while the {{toString()}} method will use the
delimiters set with the output options above. You can then use the generated
class in a mapper which calls {{parse()}} on the input records of the old
delimiter set, and emits {{Text}} objects generated with the output delimiter
set from the record's {{toString()}} method.
The default delimiters are {{,}} for fields, {{\n}} for records, no quote
character, and no escape character. Note that this can lead to
ambiguous/unparsible records if you import database records containing commas
or newlines in the field data.
This patch includes Hive support for user-defined delimiters. It will print a
warning if you use {{\-\-escaped-by}}, {{\-\-enclosed-by}}, or
{{\-\-optionally-enclosed-by}} since Hive does not know how to parse these. It
will pass the field and record terminators through to Hive. If you do not set
any delimiters and do use {{\-\-hive-import}}, the field delimiter will be set
to {{^A}} and the record delimiter will be set to {{\n}} to be consistent with
Hive's defaults.
This patch includes integrated {{mysqldump}} support for user-defined
delimiters. If your delimiters exactly match the delimiters used by
{{mysqldump}}, then Sqoop will use a fast-path that copies the data directly
from {{mysqldump}}'s output into HDFS. Otherwise, Sqoop will parse
{{mysqldump}}'s output into fields and transcode them into the user-specified
delimiter set. This is about 50% slower (as measured on a 1.5 GB test dataset
import on my machine). For convenience, the {{\-\-mysql-delimiters}} argument
will set all the output delimiters to be consistent with {{mysqldump}}'s format.
The output of the {{\-\-mysql-delimiters}}-based test says:
{{INFO manager.LocalMySQLManager: Transferred 1.5747 GB in 124.7751 seconds
(12.9231 MB/sec)}}
Whereas with the default delimiters, Sqoop reports:
{{INFO manager.LocalMySQLManager: Transferred 1.5328 GB in 181.3032 seconds
(8.6571 MB/sec)}}
This patch is based after MAPREDUCE-685. That should be committed to trunk
before this one.
This patch adds another file in {{testdata/hive/scripts}} so the number of
release audit warnings is expected to increase by 1.
I have run all the Sqoop unit tests (including the several new tests added by
this patch) and they pass on my machine. This includes the LocalMySQLTest which
tests {{mysqldump}} compatibility, but is not run by Hudson. As noted above, I
also run an at-scale test importing 1.5 GB of data from MySQL to HDFS.
> User-configurable quote and delimiter characters for Sqoop records and record
> reparsing
> ---------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-705
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-705
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: contrib/sqoop
> Reporter: Aaron Kimball
> Assignee: Aaron Kimball
> Attachments: MAPREDUCE-705.patch
>
>
> Sqoop needs a mechanism for users to govern how fields are quoted and what
> delimiter characters separate fields and records. With delimiters providing
> an unambiguous format, a parse method can reconstitute the generated record
> data object from a text-based representation of the same record.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.