JingGe opened a new pull request #17501:
URL: https://github.com/apache/flink/pull/17501
## What is the purpose of the change
The goal of this PR to provide AvroParquetRecordFormat implementation to
read avro GenericRecords from parquet via the new Flink FileSource.
This is the draft PR, there is one failed unit test case, which is
documented out for now with TODO comment. I am still working on it.
## Brief change log
- Create RecordFormat interface which focuses on Path with default methods
implementation and delegate createReader() calls to the overloaded methods from
StreamFormat that focuses on FDDataInputStream.
- Create AvroParquetRecordFormat implementation. Only reading avro
GenericRecord from parquet file or stream is supported in this version. Support
for other avro record types will be implemented later.
- Splitting is not supported in this version.
## Open Questions
To give you some background, the original idea was to let
AvroParquetRecordFormat implement FileRecordFormat. After considering that
FileRecordFormat and StreamFormat have too many commons and StreamFormat has
more built-in features like compression support(via StreamFormatAdapter),
current design is based on StreamFormat. In order to keep SIP clear, 2-levels
interfaces have been defined. Let StreamFormat focuses on the abstract input
stream and let RecordFormat pay attention to the concrete FileSystem, i.e. the
Path. RecordFormat provides default implementation for the overloaded
createReader(...) methods. Subclasses are therefore not forced to implement it.
Following are some questions open for discussion:
1. Compare to the 2-levels interfaces design, as another option, all default
methods implemented in the RecordFormat could be merged into the StreamFormat.
Such design keeps all createReader(...) methods in one interface(StreamFormat
only, no more RecordFormat) that is in some ways easier to use. The downside is
the SIP violation. Based on these consideration, I chose the current 2-levels
API design.
2. After considering priorities, Splitting is currently not supported, since
we didn't get strong requirement from the business side. It will be implemented
later, when the time comes.
3. If this design works well, as next step, we should consider replacing the
FileRecordFormat with the RecordFormat. In this way, the duplicated code and
javacode could be avoided. This question is a little bit out of the scope of
this PR, but does relate to the topic, could be therefore discussed too.
## Verifying this change
This change added tests and can be verified as follows:
New unit test validates that :
- Reader can be created and restored correctly via Path.
- Violation constraint checks for no splitting, null path, and no
restoreOffset are working well.
- GenericRecords can be read correctly from parquet file.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / **no**)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (**yes** / no)
- The serializers: (yes / **no** / don't know)
- The runtime per-record code paths (performance sensitive): (**yes** / no
/ don't know)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't
know)
- The S3 file system connector: (yes / **no** / don't know)
## Documentation
- Does this pull request introduce a new feature? (**yes** / no)
- If yes, how is the feature documented? (not applicable / docs /
**JavaDocs** / not documented)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]