Now that the CSVFileSource is in crunch 0.8.3, I’ve been trying to integrate it into the project that originally spurred its creation. However, I’m running into some weird issues.
Reading and directly materializing and using a new CSVFileSource works fine, that scenario is already in the CSVFileSourceIT. https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41 But, as soon as I try to do something extra with that PCollection, say, use count() to turn it into a PTable, grab its key set, then print it out, everything falls apart New Test: https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56 Result: http://pastebin.com/f7iUQ73N It seems that, when some additional actions are added to the pipeline, a CSVRecordReader is being created in CrunchRecordReader without going through the CSVFileSource or CSVInputFormat flow, where its various parsing options are normally configured. I was able to fix this issue by copying the "configure” method from CSVInputFormat and adding it to the beginning of the “initialize” method of the CSVRecordReader, which forces it to check the job config and configure itself if some options are null, but I don’t really feel like this is ideal. Did I miss something when I was designing this set of classes? Is this behavior expected? CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.