Well, the test itself looks a little odd to me- why are you calling pipeline.run() right after pipeline.read(new CSVFileSource(...))? There's nothing for the pipeline to do at that point.
J On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <mac.champ...@cerner.com> wrote: > Now that the CSVFileSource is in crunch 0.8.3, I’ve been trying to > integrate it into the project that originally spurred its creation. > However, I’m running into some weird issues. > > Reading and directly materializing and using a new CSVFileSource works > fine, that scenario is already in the CSVFileSourceIT. > > https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41 > > But, as soon as I try to do something extra with that PCollection, say, > use count() to turn it into a PTable, grab its key set, then print it out, > everything falls apart > New Test: > > https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56 > > Result: > http://pastebin.com/f7iUQ73N > > It seems that, when some additional actions are added to the pipeline, a > CSVRecordReader is being created in CrunchRecordReader without going > through the CSVFileSource or CSVInputFormat flow, where its various parsing > options are normally configured. > > I was able to fix this issue by copying the "configure” method from > CSVInputFormat and adding it to the beginning of the “initialize” method of > the CSVRecordReader, which forces it to check the job config and configure > itself if some options are null, but I don’t really feel like this is > ideal. Did I miss something when I was designing this set of classes? Is > this behavior expected? > > CONFIDENTIALITY NOTICE This message and any included attachments are from > Cerner Corporation and are intended only for the addressee. The information > contained in this message is confidential and may constitute inside or > non-public information under international, federal, or state securities > laws. Unauthorized forwarding, printing, copying, distribution, or use of > such information is strictly prohibited and may be unlawful. If you are not > the addressee, please promptly delete this message and notify the sender of > the delivery error by e-mail or you may call Cerner's corporate offices in > Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>