JingGe opened a new pull request #17501:
URL: https://github.com/apache/flink/pull/17501


   ## What is the purpose of the change
   
   The goal of this PR is to provide AvroParquetRecordFormat implementation to 
read avro GenericRecords from parquet via the new Flink FileSource. 
   
   Another goal of this PR is to make everyone on the same page w.r.t. the 
implementation design. Once the design is settled down, the splitting support 
and more avro record types beyond the GenericRecord as new features will be 
easily done in a follow-up PRs.
   
   This is the draft PR, there is one failed unit test case, which is 
documented out for now with TODO comment. I am still working on it.
   
   ## Brief change log
   
   - Create RecordFormat interface which focuses on Path with default methods 
implementation and delegate createReader() calls to the overloaded methods from 
StreamFormat that focuses on FDDataInputStream.
   - Create AvroParquetRecordFormat implementation. Only reading avro 
GenericRecord from parquet file or stream is supported in this version. Support 
for other avro record types will be implemented later.
   - Splitting is not supported in this version. 
   
   ## Open Questions
   
   To give you some background, the original idea was to let 
AvroParquetRecordFormat implement FileRecordFormat. After considering that 
FileRecordFormat and StreamFormat have too many commons and StreamFormat has 
more built-in features like compression support(via StreamFormatAdapter), 
current design is based on StreamFormat. In order to keep SIP clear, 2-levels 
interfaces have been defined.  Let StreamFormat focuses on the abstract input 
stream and let RecordFormat pay attention to the concrete FileSystem, i.e. the 
Path. RecordFormat provides default implementation for the overloaded 
createReader(...) methods. Subclasses are therefore not forced to implement 
them.
   
   Following are some questions open for discussion:
   
   1. Compare to the 2-levels interfaces design, as another option, all default 
methods implemented in the RecordFormat could be merged into the StreamFormat. 
Such design keeps all createReader(...) methods in one interface(StreamFormat 
only, no more RecordFormat) that is in some ways easier to use. The downside is 
the SIP violation. Based on these consideration, I chose the current 2-levels 
API design.
   2. After considering priorities, Splitting is currently not supported, since 
we didn't get strong requirement from the business side. It will be implemented 
later, when the time comes.
   3. If this design works well, as next step, we should consider replacing the 
FileRecordFormat with the RecordFormat. In this way, the duplicated code and 
javacode could be avoided. This question is a little bit out of the scope of 
this PR, but does relate to the topic, could be therefore discussed too.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     New unit test validates that : 
   - Reader can be created and restored correctly via Path.
   - Violation constraint checks for no splitting, null path, and no 
restoreOffset are working well.
   - GenericRecords can be read correctly from parquet file.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (**yes** / no)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (**yes** / no)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (**yes** / no 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't 
know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (**yes** / no)
     - If yes, how is the feature documented? (not applicable / docs / 
**JavaDocs** / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to