Inlined. On Tue, Mar 19, 2013 at 6:54 AM, Christian Tzolov < christian.tzo...@gmail.com> wrote:
> @Josh, most of the time I can manage to steer away from multiline records > but with gov. organisations it is difficult to alter what they > have considered as a 'standard'. > Can you please elaborate on your idea for named records/rows? > Yeah, I posted a library of Crunch-based tools for machine learning that I've been working on for the past couple of months: https://github.com/cloudera/ml The core module defines a Record interface that should eventually support working w/Avro records, HCatalog records, CSV files, and even Vectors-- anything that can be made to look/feel like a typed tuple of values, and the parallel module defines associated PTypes for the various implementations. I don't have the sophistication on the APIs that Matthias mentioned (in terms of evolving immutable objects), but that is the direction I expect to go in. J > @Harsh, thanks for the references. I remember I had some issues with > OpenCSV (either the iterator suport or some RFC4180 limitations). But I > would check the other sources. > > Thanks, > Chris > > > > On Tue, Mar 19, 2013 at 12:44 AM, Harsh J <ha...@cloudera.com> wrote: > > > Does OpenCSV (http://opencsv.sourceforge.net/#what-features) support > > your format? There's a Hive wrapper for it: > > http://ogrodnek.github.com/csv-serde and IIRC also a newer InputFormat > > at https://github.com/mvallebr/CSVInputFormat (via > > https://issues.apache.org/jira/browse/MAPREDUCE-2208). > > > > On Mon, Mar 18, 2013 at 3:44 PM, Christian Tzolov > > <christian.tzo...@gmail.com> wrote: > > > Hi, > > > > > > I am working on ETL projects that consume and produce data in the > RFC4180 > > > [1] CSV format. Although unreliable IMO, this RFC is used as an > exchange > > > format by several Dutch government agencies. > > > > > > The RFC4180 spec supports multi-line fields (e.g. fields with line > > > breaks) and escaping of double quotes and delimiters within fields. > > Because > > > of the multi-line feature one can't use directly the > > > FileInputFormat/TextInputFormat or LineRecordReader implementations. > > > Furthermore as I see it the input splitting must be disabled (not sure > if > > > any efficient splitting strategy is possible at all). > > > > > > There are several java libraries that provide some RFC4180 support [3]. > > For > > > Pig a slightly modified CSVExcelStorage UDF [2] seems to do the job > (not > > > sure about the input splitting though). Also the "Hadoop in Practice" > > > example [4] does not support the multi-line fields. > > > > > > Has someone used similar 'multi-line fields' formats? I wonder how > common > > > is this use case. > > > > > > Also shall we provide support for it in Crunch? > > > > > > Cheers, > > > Chris > > > > > > [1] RFC 4180 - http://tools.ietf.org/html/rfc4180 > > > [2] PIG CVSExcelStorage UDF - > > > > > > http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java > > > [3] jCSV, OpenCSV, SuperCSV > > > [4] > > > > > > https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch3/csv/CSVInputFormat.java > > > > > > > > -- > > Harsh J > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>