[
https://issues.apache.org/jira/browse/BEAM-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Soumabrata Chakraborty updated BEAM-7753:
-----------------------------------------
Description:
Currently the structure of different File IO classes seem to be to let the
Format specific IO (e.g. TextIO, XmlIO, etc) provide a SourceFunction that
knows how to split a file for that specific format and how to read records for
that format.
However, the developer/end-user has no choice in terms of how the output
element is constructed or what its type would be.
For example, each format specific IO will typically convert from
PCollection<ReadableFile> --> PCollection<T> where T varies for different file
formats (E.g. T = String for TextIO while T = Pojo generated from XSD for XmlIO
and so on)
At the moment, the end-user can add a ParDo of <T> --> <OUT> i.e. convert the
PCollection<T> --> PCollection<OUT>
However, OUT in the above case can only be constructed from file data and the
user has no easy way to get access to the file metadata from which the record T
originated.
For example, the OUT record might need to contain metadata of the file location
from which the record originated. i.e. We want f(T, ReadableFile) -> OUT
instead of f(T) -> OUT
To do this, every File based IO should provide the user the flexibility to
plugin a function that gives the user control to create OUT from Data +
Metadata (T + ReadableFile + Other Metadata where applicable)
I would be happy to take up and implement this task if folks feel that this is
a worthy goal to achieve in the File based IOs.
Possible solutions:
1. The simpler solution (but less flexible) would be to simply convert
ReadAllViaFileBasedSource.ReadFileRangesFn from DoFn<KV<ReadableFile,
OffsetRange>, T> --> to --> DoFn<KV<ReadableFile, OffsetRange>,
KV<ReadableFile, T>>
or by extention convert ReadAllViaFileBasedSource from
PTransform<PCollection<ReadableFile>, PCollection<T>> --> to -->
PTransform<PCollection<ReadableFile>, PCollection<KV<ReadableFile, T>>>
However, this approach is restrictive in the sense that we assume that the only
metadata the user is interested in is the metadata available within
ReadableFile.
If the user needs to have access to other metadata information like offset
ranges or other format specific metadata, then this design wont allow for that.
2. The more flexible solution is to allow the user to configure a function, say
EncodeFn<T, OUT> with a signature that looks like OUT encode(ReadableFile, T).
That way the user has full control over the type of OUT and the user also has
access to metadata (ReadableFile) and can thus build OUT from data + metadata
(T + ReadableFile)
The first option then simply becomes a special case of this, where we use
EncodeFn<T, KV<ReadableFile, T> (i.e. OUT = KV<ReadableFile, T>)
Also, it is easy to maintain backward compatibility with existing readAll()
features of all File Based IOs since they essentially evaluate to a special
case where we use EncodeFn<T, T> (OUT = T)
This change would need to be done in homogenous way across all the existing
File Based IO classes
was:
Currently the structure of different File IO classes seem to be to let the
Format specific IO (e.g. TextIO, XmlIO, etc) provide a SourceFunction that
knows how to split a file for that specific format and how to read records for
that format.
However, the developer/end-user has no choice in terms of how the output
element is constructed or what its type would be.
For example, each format specific IO will typically convert from
PCollection<ReadableFile> --> PCollection<T> where T varies for different file
formats (E.g. T = String for TextIO while T = Pojo generated from XSD for XmlIO
and so on)
At the moment, the end-user can add a ParDo of <T> --> <OUT> i.e. convert the
PCollection<T> --> PCollection<OUT>
However, OUT in the above case can only be constructed from file data and the
user has no easy way to get access to the file metadata from which the record T
originated.
For example, the OUT record might need to contain metadata of the file location
from which the record originated. i.e. We want f(T, ReadableFile) -> OUT
instead of f(T) -> OUT
To do this, every File based IO should provide the user the flexibility to
plugin a function that gives the user control to create OUT from Data +
Metadata (T + ReadableFile + Other Metadata where applicable)
I would be happy to take up and implement this task if folks feel that this is
a worthy goal to achieve in the File based IOs.
Possible solutions:
1. The simpler solution (but less flexible) would be to simply convert
ReadAllViaFileBasedSource.ReadFileRangesFn from DoFn<KV<ReadableFile,
OffsetRange>, T> --> to --> DoFn<KV<ReadableFile, OffsetRange>,
KV<ReadableFile, T>>
or by extention convert ReadAllViaFileBasedSource from
PTransform<PCollection<ReadableFile>, PCollection<T>> --> to -->
PTransform<PCollection<ReadableFile>, PCollection<KV<ReadableFile, T>>>
However, this approach is restrictive in the sense that we assume that the only
metadata the user is interested in is the metadata available within
ReadableFile.
If the user needs to have access to other metadata information like offset
ranges or other format specific metadata, then this design wont allow for that.
2. The more flexible solution is to allow the user to configure a function, say
EncodeFn<T, OUT> with a signature that looks like OUT encode(ReadableFile, T).
That way the user has full control over the type of OUT and the user also has
access to metadata (ReadableFile) and can thus build OUT from data + metadata
(T + ReadableFile)
The first option then simply becomes a special case of this, where we use
EncodeFn<T, KV<ReadableFile, T> (i.e. OUT = KV<ReadableFile, T>)
Also, it is easy to maintain backward compatibility with existing readAll()
features of all File Based IOs since they essentially evaluate to a special
case where we use EncodeFn<T, T> (OUT = T)
> All File based IO to provide flexibility to plugin custom logic to create
> output element from data and file metadata
> --------------------------------------------------------------------------------------------------------------------
>
> Key: BEAM-7753
> URL: https://issues.apache.org/jira/browse/BEAM-7753
> Project: Beam
> Issue Type: Improvement
> Components: io-java-files
> Reporter: Soumabrata Chakraborty
> Priority: Major
>
> Currently the structure of different File IO classes seem to be to let the
> Format specific IO (e.g. TextIO, XmlIO, etc) provide a SourceFunction that
> knows how to split a file for that specific format and how to read records
> for that format.
> However, the developer/end-user has no choice in terms of how the output
> element is constructed or what its type would be.
> For example, each format specific IO will typically convert from
> PCollection<ReadableFile> --> PCollection<T> where T varies for different
> file formats (E.g. T = String for TextIO while T = Pojo generated from XSD
> for XmlIO and so on)
> At the moment, the end-user can add a ParDo of <T> --> <OUT> i.e. convert the
> PCollection<T> --> PCollection<OUT>
> However, OUT in the above case can only be constructed from file data and the
> user has no easy way to get access to the file metadata from which the record
> T originated.
> For example, the OUT record might need to contain metadata of the file
> location from which the record originated. i.e. We want f(T, ReadableFile) ->
> OUT instead of f(T) -> OUT
> To do this, every File based IO should provide the user the flexibility to
> plugin a function that gives the user control to create OUT from Data +
> Metadata (T + ReadableFile + Other Metadata where applicable)
> I would be happy to take up and implement this task if folks feel that this
> is a worthy goal to achieve in the File based IOs.
> Possible solutions:
> 1. The simpler solution (but less flexible) would be to simply convert
> ReadAllViaFileBasedSource.ReadFileRangesFn from DoFn<KV<ReadableFile,
> OffsetRange>, T> --> to --> DoFn<KV<ReadableFile, OffsetRange>,
> KV<ReadableFile, T>>
> or by extention convert ReadAllViaFileBasedSource from
> PTransform<PCollection<ReadableFile>, PCollection<T>> --> to -->
> PTransform<PCollection<ReadableFile>, PCollection<KV<ReadableFile, T>>>
> However, this approach is restrictive in the sense that we assume that the
> only metadata the user is interested in is the metadata available within
> ReadableFile.
> If the user needs to have access to other metadata information like offset
> ranges or other format specific metadata, then this design wont allow for
> that.
> 2. The more flexible solution is to allow the user to configure a function,
> say EncodeFn<T, OUT> with a signature that looks like OUT
> encode(ReadableFile, T). That way the user has full control over the type of
> OUT and the user also has access to metadata (ReadableFile) and can thus
> build OUT from data + metadata (T + ReadableFile)
> The first option then simply becomes a special case of this, where we use
> EncodeFn<T, KV<ReadableFile, T> (i.e. OUT = KV<ReadableFile, T>)
> Also, it is easy to maintain backward compatibility with existing readAll()
> features of all File Based IOs since they essentially evaluate to a special
> case where we use EncodeFn<T, T> (OUT = T)
> This change would need to be done in homogenous way across all the existing
> File Based IO classes
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)