On Monday, 2013-03-18, Josh Wills wrote: > I personally try to steer people away from multi-line input formats b/c of > how tedious they are to write/maintain.
Same here. > To me, the question of supporting > CSVs maps to a more general question about whether we should support some > kind of named Record/Row type for processing data from > CSV/Hive/Avro/PB/Thrift/etc. in a generic way. I could make arguments > either way, which I'm happy to do if folks are interested, but I'd rather > hear from other people first, esp. if anyone feels strongly about it. I have used something like it in aggregation and machine learning systems and I've grown quite fond it. It is basically a HashMap that is partially immutable - once you add a value you can't change it anymore. You can structure your system as a sequence of rules that each adds fields to the record. This is quite flexible, you can work with changing schemas and different sets of rules easily. Regards, Matthias > > On Mon, Mar 18, 2013 at 3:14 AM, Christian Tzolov < > christian.tzo...@gmail.com> wrote: > > > Hi, > > > > I am working on ETL projects that consume and produce data in the RFC4180 > > [1] CSV format. Although unreliable IMO, this RFC is used as an exchange > > format by several Dutch government agencies. > > > > The RFC4180 spec supports multi-line fields (e.g. fields with line > > breaks) and escaping of double quotes and delimiters within fields. Because > > of the multi-line feature one can't use directly the > > FileInputFormat/TextInputFormat or LineRecordReader implementations. > > Furthermore as I see it the input splitting must be disabled (not sure if > > any efficient splitting strategy is possible at all). > > > > There are several java libraries that provide some RFC4180 support [3]. For > > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job (not > > sure about the input splitting though). Also the "Hadoop in Practice" > > example [4] does not support the multi-line fields. > > > > Has someone used similar 'multi-line fields' formats? I wonder how common > > is this use case. > > > > Also shall we provide support for it in Crunch? > > > > Cheers, > > Chris > > > > [1] RFC 4180 - http://tools.ietf.org/html/rfc4180 > > [2] PIG CVSExcelStorage UDF - > > > > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java > > [3] jCSV, OpenCSV, SuperCSV > > [4] > > > > https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>