[GitHub] [arrow] westonpace commented on a change in pull request #10509: ARROW-12996: Add bytes_read() to StreamingReader

GitBox Tue, 15 Jun 2021 12:12:30 -0700


westonpace commented on a change in pull request #10509:
URL: https://github.com/apache/arrow/pull/10509#discussion_r652079281




##########
File path: cpp/src/arrow/csv/reader.h
##########
@@ -73,6 +73,9 @@ class ARROW_EXPORT StreamingReader : public RecordBatchReader 
{
 
   virtual Future<std::shared_ptr<RecordBatch>> ReadNextAsync() = 0;
 
+  /// \brief Returns the number of bytes which have been read by this reader
+  virtual int64_t bytes_read() const = 0;

Review comment:
       I think `bytes_read` means "bytes the CSV reader is completely finished 
with".  So serial readahead (e.g. the readahead happening on the I/O context or 
"data read but not parsed or decoded") should not be included.  Caller 
consumption should be irrelevant.
   
   For parallel readahead (e.g. the CSV reader reading/parsing/decoding 
multiple batches of data at the same time) then my opinion is that `bytes_read` 
should be incremented as soon as a batch is ready to be delivered (even if 
there are other batches in front of it that aren't ready).
   
   Perhaps `bytes_processed` or `bytes_finished` would remove the ambiguity?  
Or maybe just a clearer docstring.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #10509: ARROW-12996: Add bytes_read() to StreamingReader

Reply via email to