1) One thing I really don¹t understand is how the tests which need extra configuration (like the one which uses asterisks as quotation marks) can pass if this configure method isn¹t being run. I know it¹s not; I¹ve even commented out that entire method and, somehow, those files are still parsed as expected. How is CSVInputFormat being configured in these cases?
2) That¹s how I fixed things at first. Though, at the time I felt that I was duplicating the configuration logic, so it didn¹t seem very appealing to me. Anyway, my confusion aside, I went with your first suggestion. CSVInputFormat now implements Configurable and the overridden setConf(Configuration) method handles the local configuration of the various control characters. All tests pass as expected now. I will clean up what I¹ve done (including pipeline.run in the tests) create a JIRA, and attach a patch ASAP. Let me know if you think I¹ve overlooked anything or if there are any test cases that I¹ve missed. Thank you for taking a look, -Mac On 6/24/14, 5:56 PM, "Josh Wills" <jwi...@cloudera.com> wrote: >Okay, looking a bit closer: > >1) CSVInputFormat specifies an implementation of configure(JobConf), but >should extend the Configured abstract base class so that the system knows >to call configure() on the class once it gets instantiated-- right now, >that configure(JobConf) isn't being called, and so the params for the CSV >file we use to configure the CSVRecordReader aren't getting set. It should >also have the signature be configure(Configuration) instead of >configure(JobConf). > >2) The other option would be to read the configuration parameters we need >for the CSVRecordReader from the Configuration object that is attached to >the TaskAttemptContext in the call to createRecordReader. That's what we >do >in AvroInputFormat to pull the schema out of the configuration: >https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org >/apache/crunch/types/avro/AvroInputFormat.java > >J > > >On Tue, Jun 24, 2014 at 3:28 PM, Josh Wills <jwi...@cloudera.com> wrote: > >> Well, the test itself looks a little odd to me- why are you calling >> pipeline.run() right after pipeline.read(new CSVFileSource(...))? >>There's >> nothing for the pipeline to do at that point. >> >> J >> >> >> On Tue, Jun 24, 2014 at 10:53 AM, Champion,Mac <mac.champ...@cerner.com> >> wrote: >> >>> Now that the CSVFileSource is in crunch 0.8.3, I¹ve been trying to >>> integrate it into the project that originally spurred its creation. >>> However, I¹m running into some weird issues. >>> >>> Reading and directly materializing and using a new CSVFileSource works >>> fine, that scenario is already in the CSVFileSourceIT. >>> >>> >>>https://github.com/apache/crunch/blob/apache-crunch-0.8.3/crunch-core/sr >>>c/it/java/org/apache/crunch/io/text/csv/CSVFileSourceIT.java#L41 >>> >>> But, as soon as I try to do something extra with that PCollection, say, >>> use count() to turn it into a PTable, grab its key set, then print it >>>out, >>> everything falls apart >>> New Test: >>> >>> >>>https://github.com/champgm/crunch/blob/master/crunch-core/src/it/java/or >>>g/apache/crunch/io/text/csv/CSVFileSourceIT.java#L56 >>> >>> Result: >>> http://pastebin.com/f7iUQ73N >>> >>> It seems that, when some additional actions are added to the pipeline, >>>a >>> CSVRecordReader is being created in CrunchRecordReader without going >>> through the CSVFileSource or CSVInputFormat flow, where its various >>>parsing >>> options are normally configured. >>> >>> I was able to fix this issue by copying the "configure² method from >>> CSVInputFormat and adding it to the beginning of the ³initialize² >>>method of >>> the CSVRecordReader, which forces it to check the job config and >>>configure >>> itself if some options are null, but I don¹t really feel like this is >>> ideal. Did I miss something when I was designing this set of classes? >>>Is >>> this behavior expected? >>> >>> CONFIDENTIALITY NOTICE This message and any included attachments are >>>from >>> Cerner Corporation and are intended only for the addressee. The >>>information >>> contained in this message is confidential and may constitute inside or >>> non-public information under international, federal, or state >>>securities >>> laws. Unauthorized forwarding, printing, copying, distribution, or use >>>of >>> such information is strictly prohibited and may be unlawful. If you >>>are not >>> the addressee, please promptly delete this message and notify the >>>sender of >>> the delivery error by e-mail or you may call Cerner's corporate >>>offices in >>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > >-- >Director of Data Science >Cloudera <http://www.cloudera.com> >Twitter: @josh_wills <http://twitter.com/josh_wills>