[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Maksym Kovalenko (Commented) (JIRA) Wed, 26 Oct 2011 18:41:58 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680
 ]


Maksym Kovalenko commented on MAPREDUCE-2208:
---------------------------------------------

So what regex one would need to specify to parse the "normal" CSV that uses 
comma as a delimiter and happen to have comma in one of the values, for example:

value1,value2,"more,complex,with,commas,value3"

just providing "," as the pattern1 will no longer work as it will produce 7 
columns for the above case instead of 3.

Also consider the following use case when value contains a double quoute. In 
this case according to CSV escaping rules it has to be escaped by another 
double quote, for example:

column1,"thank you, ""User"" for the report, again, thank you",column3

Considering above two cases what value for pattern1 should I provide?

I think configuration of CSVTextInputFormat would be more natural if instead of 
patterns, one had to provide delimiter character (comma by default) and quote 
character (double quote by default). Then I and other users won't have to 
struggle with possible regex patterns (see my questions above, I'm still 
curious if you can come up with one).

Another benefit is that from delimiter and quote characters you can create any 
regexes that you need if necessary (if you want to stick to current 
implementation). By the way, right now you have some fragility in the 
implementation when you prepend user provided regex with a "\\". This will 
break in case when user supplied pattern itself starts with "\\".
                
> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the 
> csv-style datasets I've found. The Hadoop samples I've seen all 
> FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key 
> and parse the Text value as a CSV line. But, they are all custom-coded for 
> the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into 
> the format required by a Mapper. You can drop fields & rearrange them. There 
> is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into 
> org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Reply via email to