slinkydeveloper edited a comment on pull request #17520:
URL: https://github.com/apache/flink/pull/17520#issuecomment-948377409


   @tsreaper I start getting your point of lazyness, but now I wonder, why is 
Avro different from, for example parquet? In parquet format, I see that when 
you invoke `readBatch` you load in memory the actual data and then return an 
iterator which iterates on those. So the I/O is performed within the 
`readBatch` invocation. Same for all the other formats I've looked at.
   
   Your lazy reading is changing the threading model of the source, because 
with this PR the "consumer" thread (`SourceReaderBase::pollNext` in particular) 
will be the thread which will perform the actual I/O processing. Why is Avro 
special in this sense? Why don't we do the same for Parquet? And are we sure 
this is the way the source architecture is intended to work? Couldn't this 
cause issues because the original design might have not considered performing 
I/O within the consumer thread?   
   
   Even with lazyness, I still think we don't need concurrency primitives to 
serialize `nextBatch` invocations because each `SourceOperator` has its own 
instance of `SourceReader` (look at the method `SourceOperator::initReader`) 
and from `SourceOperator` the method `SourceReaderBase::pollNext` is invoked, 
which triggers the actual reading of avro records from your lazy operator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to