[jira] Commented: (MAPREDUCE-705) User-configurable quote and delimiter characters for Sqoop records and record reparsing

Aaron Kimball (JIRA) Fri, 03 Jul 2009 12:17:13 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727100#action_12727100
 ]


Aaron Kimball commented on MAPREDUCE-705:
-----------------------------------------


I have implemented delimiter control for Sqoop. Users can now customize which 
characters are used as delimiters between fields and records. Fields can also 
be enclosed in user-configurable quote characters, which can be made mandatory 
(all fields always quoted) or optional (fields only quoted when they contain a 
delimiter). An optional user-configurable escape character allows users to 
escape quotes (and more escape characters) inside fields.

This modifies the toString() method of generated classes to use the proper 
quote, delimiter, and escape characters for each record.

Sqoop-generated record types now also contain a parse() method which can read a 
record separated by the specified field delimiters and terminated by the 
specified record-termination delimiter, and can contain escapes and quotes as 
described above. The record will repopulate all its fields by reading the 
record. For example, suppose a table was created with:

{{CREATE TABLE mytable (foo INT, bar VARCHAR(32), baz DATE);}}

Importing this table from the database into HDFS via Sqoop would generate a 
class named {{mytable}} which can contain a single record consisting of 
{{foo}}, {{bar}}, and {{baz}} fields. The text file emitted in HDFS may contain 
lines like:

{{42,'hello','2009-07-04'}}

The {{mytable.parse()}} method will read a line of this form and populate its 
{{foo}}, {{bar}} and {{baz}} variables based on the three fields it can 
interpret.

The delimiters used when writing text to HDFS are controlled by these options:

{code}
Output line formatting options:
--fields-terminated-by (char)    Sets the field separator character
--lines-terminated-by (char)     Sets the end-of-line character
--optionally-enclosed-by (char)  Sets a field enclosing character
--enclosed-by (char)             Sets a required field enclosing char
--escaped-by (char)              Sets the escape character
--mysql-delimiters               Uses MySQL's default delimiter set
  fields: ,  lines: \n  escaped-by: \  optionally-enclosed-by: '
{code}

The {{(char)}} argument above can be specified either as a normal character 
(e.g., {{\-\-fields-terminated-by ,}}) or via an escape sequence. Arguments of 
the form {{\0xhhh}} will be interpreted as a hexidecimal representation of a 
character with hex number _hhh_. Arguments of the form {{\0ooo}} will be 
treated as an octal representation of a character represented by octal number 
_ooo_. The special escapes {{\n}}, {{\r}}, {{\"}}, and {{\b}} as well as two 
backslashes (which JIRA refuses to format...) act as they do inside Java 
strings. {{\0}} will be treated as {{NUL}}. This will insert {{NUL}} characters 
between fields or lines ((if used for {{\-\-fields-terminated-by}} or 
{{\-\-lines-terminated-by}}), or will disable enclosing/escaping if used for 
one of the {{\-\-enclosed-by}}, {{\-\-optionally-enclosed-by}}, or 
{{\-\-escaped-by}} arguments. Enclosing and escaping are disabled by default. 
For unambiguous parsing, both must be enabled, e.g., via 
{{\-\-mysql-delimiters}}.

Sqoop can also transcode one delimiter set into another. So if your data is 
currently enclosed in a certain set of delimiters, you can use Sqoop to help 
you convert it to a new set of delimiters.

The options described above will set the output delimiters used by 
{{toString()}}. If none of the following options are used, they also set the 
input delimiters used by {{parse()}}. But the input delimiters can be 
overridden on an individual basis with the options:

{code}
Input parsing options:
--input-fields-terminated-by (char)    Sets the input field separator
--input-lines-terminated-by (char)     Sets the input end-of-line char
--input-optionally-enclosed-by (char)  Sets a field enclosing character
--input-enclosed-by (char)             Sets a required field encloser
--input-escaped-by (char)              Sets the input escape character
{code}

If you use these options, then the {{parse()}} method generated for your type 
will use these delimiters, while the {{toString()}} method will use the 
delimiters set with the output options above. You can then use the generated 
class in a mapper which calls {{parse()}} on the input records of the old 
delimiter set, and emits {{Text}} objects generated with the output delimiter 
set from the record's {{toString()}} method.

The default delimiters are {{,}} for fields, {{\n}} for records, no quote 
character, and no escape character. Note that this can lead to 
ambiguous/unparsible records if you import database records containing commas 
or newlines in the field data.

This patch includes Hive support for user-defined delimiters. It will print a 
warning if you use {{\-\-escaped-by}}, {{\-\-enclosed-by}}, or 
{{\-\-optionally-enclosed-by}} since Hive does not know how to parse these. It 
will pass the field and record terminators through to Hive. If you do not set 
any delimiters and do use {{\-\-hive-import}}, the field delimiter will be set 
to {{^A}} and the record delimiter will be set to {{\n}} to be consistent with 
Hive's defaults.

This patch includes integrated {{mysqldump}} support for user-defined 
delimiters. If your delimiters exactly match the delimiters used by 
{{mysqldump}}, then Sqoop will use a fast-path that copies the data directly 
from {{mysqldump}}'s output into HDFS. Otherwise, Sqoop will parse 
{{mysqldump}}'s output into fields and transcode them into the user-specified 
delimiter set. This is about 50% slower (as measured on a 1.5 GB test dataset 
import on my machine). For convenience, the {{\-\-mysql-delimiters}} argument 
will set all the output delimiters to be consistent with {{mysqldump}}'s format.

The output of the {{\-\-mysql-delimiters}}-based test says:

{{INFO manager.LocalMySQLManager: Transferred 1.5747 GB in 124.7751 seconds 
(12.9231 MB/sec)}}

Whereas with the default delimiters, Sqoop reports:

{{INFO manager.LocalMySQLManager: Transferred 1.5328 GB in 181.3032 seconds 
(8.6571 MB/sec)}}

This patch is based after MAPREDUCE-685. That should be committed to trunk 
before this one.
This patch adds another file in {{testdata/hive/scripts}} so the number of 
release audit warnings is expected to increase by 1.
I have run all the Sqoop unit tests (including the several new tests added by 
this patch) and they pass on my machine. This includes the LocalMySQLTest which 
tests {{mysqldump}} compatibility, but is not run by Hudson. As noted above, I 
also run an at-scale test importing 1.5 GB of data from MySQL to HDFS.


> User-configurable quote and delimiter characters for Sqoop records and record 
> reparsing
> ---------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-705
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-705
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/sqoop
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-705.patch
>
>
> Sqoop needs a mechanism for users to govern how fields are quoted and what 
> delimiter characters separate fields and records. With delimiters providing 
> an unambiguous format, a parse method can reconstitute the generated record 
> data object from a text-based representation of the same record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-705) User-configurable quote and delimiter characters for Sqoop records and record reparsing

Reply via email to