Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support your format? There's a Hive wrapper for it: http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat at https://github.com/mvallebr/CSVInputFormat (via https://issues.apache.org/jira/browse/MAPREDUCE-2208).
On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov <christian.tzo...@gmail.com> wrote: > Hi, > > I am working on ETL projects that consume and produce data in the RFC4180 > [1] CSV format. Although unreliable IMO, this RFC is used as an exchange > format by several Dutch government agencies. > > The RFC4180 spec supports multi-line fields (e.g. fields with line > breaks) and escaping of double quotes and delimiters within fields. Because > of the multi-line feature one can't use directly the > FileInputFormat/TextInputFormat or LineRecordReader implementations. > Furthermore as I see it the input splitting must be disabled (not sure if > any efficient splitting strategy is possible at all). > > There are several java libraries that provide some RFC4180 support [3]. For > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not > sure about the input splitting though). Also the "Hadoop in Practice" > example [4] does not support the multi-line fields. > > Has someone used similar 'multi-line fields' formats? I wonder how common > is this use case. > > Also shall we provide support for it in Crunch? > > Cheers, > Chris > > [1] RFC 4180 - http://tools.ietf.org/html/rfc4180 > [2] PIG CVSExcelStorage UDF - > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java > [3] jCSV, OpenCSV, SuperCSV > [4] > https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java -- Harsh J