[GitHub] [flink] JingGe opened a new pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

GitBox Sat, 16 Oct 2021 17:47:30 -0700


JingGe opened a new pull request #17501:
URL: https://github.com/apache/flink/pull/17501



   ## What is the purpose of the change
   
   The goal of this PR to provide AvroParquetRecordFormat implementation to 
read avro GenericRecords from parquet via the new Flink FileSource. 
   
   This is the draft PR, there is one failed unit test case, which is 
documented out for now with TODO comment. I am still working on it.
   
   ## Brief change log
   
   - Create RecordFormat interface which focuses on Path with default methods 
implementation and delegate createReader() calls to the overloaded methods from 
StreamFormat that focuses on FDDataInputStream.
   - Create AvroParquetRecordFormat implementation. Only reading avro 
GenericRecord from parquet file or stream is supported in this version. Support 
for other avro record types will be implemented later.
   - Splitting is not supported in this version. 
   
   ## Open Questions
   
   To give you some background, the original idea was to let 
AvroParquetRecordFormat implement FileRecordFormat. After considering that 
FileRecordFormat and StreamFormat have too many commons and StreamFormat has 
more built-in features like compression support(via StreamFormatAdapter), 
current design is based on StreamFormat. In order to keep SIP clear, 2-levels 
interfaces have been defined.  Let StreamFormat focuses on the abstract input 
stream and let RecordFormat pay attention to the concrete FileSystem, i.e. the 
Path. RecordFormat provides default implementation for the overloaded 
createReader(...) methods. Subclasses are therefore not forced to implement it.
   
   Following are some questions open for discussion:
   
   1. Compare to the 2-levels interfaces design, as another option, all default 
methods implemented in the RecordFormat could be merged into the StreamFormat. 
Such design keeps all createReader(...) methods in one interface(StreamFormat 
only, no more RecordFormat) that is in some ways easier to use. The downside is 
the SIP violation. Based on these consideration, I chose the current 2-levels 
API design.
   2. After considering priorities, Splitting is currently not supported, since 
we didn't get strong requirement from the business side. It will be implemented 
later, when the time comes.
   3. If this design works well, as next step, we should consider replacing the 
FileRecordFormat with the RecordFormat. In this way, the duplicated code and 
javacode could be avoided. This question is a little bit out of the scope of 
this PR, but does relate to the topic, could be therefore discussed too.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     New unit test validates that : 
   - Reader can be created and restored correctly via Path.
   - Violation constraint checks for no splitting, null path, and no 
restoreOffset are working well.
   - GenericRecords can be read correctly from parquet file.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (**yes** / no)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (**yes** / no 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't 
know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (**yes** / no)
     - If yes, how is the feature documented? (not applicable / docs / 
**JavaDocs** / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] JingGe opened a new pull request #17501: [Draft][FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource

Reply via email to