[ 
https://issues.apache.org/jira/browse/CRUNCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345731#comment-16345731
 ] 

Ben Roling commented on CRUNCH-663:
-----------------------------------

So, my initial quick and dirty solution to this problem is for the 
CrunchRecordReader to share the path to the current file via a property on the 
Configuration.  The property might be called something like "crunch.split.file" 
and each time initNextRecordReader() is invoked to move to the next chunk of 
the CombineFile input split, that property would get updated to point to the 
new file.

 

DoFn's that want to know the file they are working on would look at that 
property.

 

I will share a proof-of-concept patch.  I'm curious for feedback on whether or 
not Crunch would find such a solution acceptable.  Obviously any DoFn that 
chooses to use this property for access to the file path is bound to an 
assumption that it is actually processing on top of a file source.

 

This solution was somewhat inspired by this thread on StackOverflow:
[https://stackoverflow.com/questions/17105173/hadoop-how-to-get-each-file-path-in-combinefileinputformat]

 

That thread revealed to me that the native 
org.apache.hadoop.mapred.lib.CombineFileRecordReader sets a config property 
named "map.input.file".

> Expose Record-level File Path to Processing Functions
> -----------------------------------------------------
>
>                 Key: CRUNCH-663
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-663
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ben Roling
>            Assignee: Josh Wills
>            Priority: Major
>
> We have some processing pipelines where we want to know the file path that 
> each record being processed came from.  It would be nice if this could be 
> exposed to the DoFns in our pipelines.
>  
> This same desire was expressed a little over 1 year ago on the mailing list:
> [http://mail-archives.apache.org/mod_mbox/crunch-user/201611.mbox/%3CCAG-tO+Y42KRFiocg1RJT4qFcyvkPjFSfZa4z=wk34arip4w...@mail.gmail.com%3E]
>  
> Unfortunately, that thread dead-ended.
>  
> I will use the comments section and a patch to propose a simple, albeit 
> slightly hacky solution.  Another alternative would be to create a new Source 
> that provides a PCollection<Pair<Path, Record>>, but I'm not sure of the 
> effort it would take to create that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to