[
https://issues.apache.org/jira/browse/BEAM-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on BEAM-7753 started by Soumabrata Chakraborty.
----------------------------------------------------
> All File based IO to provide flexibility to plugin custom logic to create
> output element from data and file metadata
> --------------------------------------------------------------------------------------------------------------------
>
> Key: BEAM-7753
> URL: https://issues.apache.org/jira/browse/BEAM-7753
> Project: Beam
> Issue Type: Improvement
> Components: io-java-files
> Reporter: Soumabrata Chakraborty
> Assignee: Soumabrata Chakraborty
> Priority: Major
>
> Currently the structure of different File IO classes seem to be to let the
> Format specific IO (e.g. TextIO, XmlIO, etc) provide a SourceFunction that
> knows how to split a file for that specific format and how to read records
> for that format.
> However, the developer/end-user has no choice in terms of how the output
> element is constructed or what its type would be.
> For example, each format specific IO will typically convert from
> PCollection<ReadableFile> --> PCollection<T> where T varies for different
> file formats (E.g. T = String for TextIO while T = Pojo generated from XSD
> for XmlIO and so on)
> At the moment, the end-user can add a ParDo of <T> --> <OUT> i.e. convert the
> PCollection<T> --> PCollection<OUT>
> However, OUT in the above case can only be constructed from file data and the
> user has no easy way to get access to the file metadata from which the record
> T originated.
> For example, the OUT record might need to contain metadata of the file
> location from which the record originated.
> i.e. We want f(T, ReadableFile) -> OUT instead of f(T) -> OUT
> To do this, every File based IO should provide the user the flexibility to
> plugin a function that gives the user control to create OUT from Data +
> Metadata (T + ReadableFile + Other Metadata where applicable)
> I would be happy to take up and implement this task if folks feel that this
> is a worthy goal to achieve in the File based IOs.
> Possible solutions:
> 1. The simpler solution (but less flexible) would be to simply convert
> ReadAllViaFileBasedSource.ReadFileRangesFn from DoFn<KV<ReadableFile,
> OffsetRange>, T> --> to --> DoFn<KV<ReadableFile, OffsetRange>,
> KV<ReadableFile, T>>
> or by extention convert ReadAllViaFileBasedSource from
> PTransform<PCollection<ReadableFile>, PCollection<T>> --> to -->
> PTransform<PCollection<ReadableFile>, PCollection<KV<ReadableFile, T>>>
> However, this approach is restrictive in the sense that we assume that the
> only metadata the user is interested in is the metadata available within
> ReadableFile.
> If the user needs to have access to other metadata information like offset
> ranges or other format specific metadata, then this design wont allow for
> that.
> 2. The more flexible solution is to allow the user to configure a function,
> say EncodeFn<T, OUT> with a signature that looks like OUT
> encode(ReadableFile, T). That way the user has full control over the type of
> OUT and the user also has access to metadata (ReadableFile) and can thus
> build OUT from data + metadata (T + ReadableFile)
> The first option then simply becomes a special case of this, where we use
> EncodeFn<T, KV<ReadableFile, T> (i.e. OUT = KV<ReadableFile, T>)
> Also, it is easy to maintain backward compatibility with existing readAll()
> features of all File Based IOs since they essentially evaluate to a special
> case where we use EncodeFn<T, T> (OUT = T)
> This change would need to be done in homogenous way across all the existing
> File Based IO classes
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)