[GitHub] [arrow] westonpace commented on pull request #9802: ARROW-10882: [Python] Allow writing dataset from iterator of batches

GitBox Tue, 06 Apr 2021 08:28:13 -0700


westonpace commented on pull request #9802:
URL: https://github.com/apache/arrow/pull/9802#issuecomment-814212658



   Postmortem comments now that I'm reviewing this in more detail to merge 
:smiley: 
   
   * If you accept a record batch reader then "in-memory" could be misleading.  
There is nothing preventing you from passing an IPC reader of any kind.  In the 
future we might want to rename this to something like ExternalDataset or 
PipedDataset or StreamingDataset.
   
   * Until we interface with Python async there is no way to really scan this 
asynchronously.  I can either scan it on a background thread or use the CPU 
thread and simply pray that the reader doesn't block.  For now I'll do the 
latter but going forwards maybe we should split this into two different dataset 
classes?  An InMemoryDataset which wraps a list of batches (or table[s]) and a 
piped dataset which wraps a reader/iterator?  The latter would be consumed by 
an I/O thread while the former would just get consumed on the CPU thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #9802: ARROW-10882: [Python] Allow writing dataset from iterator of batches

Reply via email to