[ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680 ]
Maksym Kovalenko commented on MAPREDUCE-2208: --------------------------------------------- So what regex one would need to specify to parse the "normal" CSV that uses comma as a delimiter and happen to have comma in one of the values, for example: value1,value2,"more,complex,with,commas,value3" just providing "," as the pattern1 will no longer work as it will produce 7 columns for the above case instead of 3. Also consider the following use case when value contains a double quoute. In this case according to CSV escaping rules it has to be escaped by another double quote, for example: column1,"thank you, ""User"" for the report, again, thank you",column3 Considering above two cases what value for pattern1 should I provide? I think configuration of CSVTextInputFormat would be more natural if instead of patterns, one had to provide delimiter character (comma by default) and quote character (double quote by default). Then I and other users won't have to struggle with possible regex patterns (see my questions above, I'm still curious if you can come up with one). Another benefit is that from delimiter and quote characters you can create any regexes that you need if necessary (if you want to stick to current implementation). By the way, right now you have some fragility in the implementation when you prepend user provided regex with a "\\". This will break in case when user supplied pattern itself starts with "\\". > Flexible CSV text parser InputFormat > ------------------------------------ > > Key: MAPREDUCE-2208 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Lance Norskog > Priority: Trivial > Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java > > > CSVTextInputFormat is a configurable CSV parser tuned to most of the > csv-style datasets I've found. The Hadoop samples I've seen all > FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key > and parse the Text value as a CSV line. But, they are all custom-coded for > the format. > CSVTextInputFormat takes any csv-encoded file and rearrange the fields into > the format required by a Mapper. You can drop fields & rearrange them. There > is also a random sampling option to make training/test runs easier. > Attached are CSVTextInputFormat.java and a unit test for it. Both go into > org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src. > This is compiled against hadoop-0.0.20. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira