[ 
https://issues.apache.org/jira/browse/BEAM-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-7753 started by Soumabrata Chakraborty.
----------------------------------------------------
> All File based IO to provide flexibility to plugin custom logic to create 
> output element from data and file metadata
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-7753
>                 URL: https://issues.apache.org/jira/browse/BEAM-7753
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-files
>            Reporter: Soumabrata Chakraborty
>            Assignee: Soumabrata Chakraborty
>            Priority: Major
>
> Currently the structure of different File IO classes seem to be to let the 
> Format specific IO (e.g. TextIO, XmlIO, etc) provide a SourceFunction that 
> knows how to split a file for that specific format and how to read records 
> for that format.
> However, the developer/end-user has no choice in terms of how the output 
> element is constructed or what its type would be.
> For example, each format specific IO will typically convert from 
> PCollection<ReadableFile> --> PCollection<T> where T varies for different 
> file formats (E.g. T = String for TextIO while T = Pojo generated from XSD 
> for XmlIO and so on)
> At the moment, the end-user can add a ParDo of <T> --> <OUT> i.e. convert the 
> PCollection<T> --> PCollection<OUT>
> However, OUT in the above case can only be constructed from file data and the 
> user has no easy way to get access to the file metadata from which the record 
> T originated.
> For example, the OUT record might need to contain metadata of the file 
> location from which the record originated.
> i.e. We want f(T, ReadableFile) -> OUT instead of f(T) -> OUT
> To do this, every File based IO should provide the user the flexibility to 
> plugin a function that gives the user control to create OUT from Data + 
> Metadata (T + ReadableFile + Other Metadata where applicable)
> I would be happy to take up and implement this task if folks feel that this 
> is a worthy goal to achieve in the File based IOs.
> Possible solutions:
> 1. The simpler solution (but less flexible) would be to simply convert 
> ReadAllViaFileBasedSource.ReadFileRangesFn from DoFn<KV<ReadableFile, 
> OffsetRange>, T> --> to --> DoFn<KV<ReadableFile, OffsetRange>, 
> KV<ReadableFile, T>>
>  or by extention convert ReadAllViaFileBasedSource from 
> PTransform<PCollection<ReadableFile>, PCollection<T>> --> to --> 
> PTransform<PCollection<ReadableFile>, PCollection<KV<ReadableFile, T>>>
> However, this approach is restrictive in the sense that we assume that the 
> only metadata the user is interested in is the metadata available within 
> ReadableFile.
>  If the user needs to have access to other metadata information like offset 
> ranges or other format specific metadata, then this design wont allow for 
> that.
> 2. The more flexible solution is to allow the user to configure a function, 
> say EncodeFn<T, OUT> with a signature that looks like OUT 
> encode(ReadableFile, T). That way the user has full control over the type of 
> OUT and the user also has access to metadata (ReadableFile) and can thus 
> build OUT from data + metadata (T + ReadableFile)
> The first option then simply becomes a special case of this, where we use 
> EncodeFn<T, KV<ReadableFile, T> (i.e. OUT = KV<ReadableFile, T>)
>  Also, it is easy to maintain backward compatibility with existing readAll() 
> features of all File Based IOs since they essentially evaluate to a special 
> case where we use EncodeFn<T, T> (OUT = T)
> This change would need to be done in homogenous way across all the existing 
> File Based IO classes



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to