Hey Mac, FWIW, I would be very happy to have it in the project and would be glad for the contribution.
J On Fri, Feb 28, 2014 at 10:04 AM, Champion,Mac <[email protected]>wrote: > Hi all, > > Late last year, my team decided to change the way we processed large CSV > files. Up until then, we had been parsing them locally, sending the avros > to hdfs and processing them from there with crunch. This was irritating > and limiting in a few ways, so we decided to stop processing locally and > do the parsing/loading entirely in crunch. Everything went well using > TextFileSource until we ran into one of our files which contained a CSV > record with multiple lines in one field. > > Here's a possible example of a record spanning multiple lines: > > > "Champion, Mac","1234 Hoth St. > Apartment 101 > Atlanta, GA > 64086","30","M","5/28/2010 12:00:00 AM","Just some guy" > > To deal with this, I wrote a CSVInputFormat and CSVRecordReader that can > intelligently split and parse CSV files while maintaining the integrity of > each record. This works great, but using it a little messy. > > We have to read from the files like this: > > > final PTable<Long, String> csvFile = > pipeline.read(disableFileCombine(From.formattedFile(outputPath, > CSVInputFormat.class, Writables.longs(), Writables.strings()))); > > > What I propose is that we extend FileSourceImpl in a way similar to > NLineFileSource and/or TextFileSource and submit the extension and its CSV > parsing logic as a patch to Crunch. Is this a valid idea for a new JIRA? > Would other users of Crunch find this ability to reliably parse out CSV > Records valuable? If so, I would like to log a JIRA and begin working on > it in the very near future. > > CONFIDENTIALITY NOTICE This message and any included attachments are from > Cerner Corporation and are intended only for the addressee. The information > contained in this message is confidential and may constitute inside or > non-public information under international, federal, or state securities > laws. Unauthorized forwarding, printing, copying, distribution, or use of > such information is strictly prohibited and may be unlawful. If you are not > the addressee, please promptly delete this message and notify the sender of > the delivery error by e-mail or you may call Cerner's corporate offices in > Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
