westonpace commented on pull request #9802: URL: https://github.com/apache/arrow/pull/9802#issuecomment-814212658
Postmortem comments now that I'm reviewing this in more detail to merge :smiley: * If you accept a record batch reader then "in-memory" could be misleading. There is nothing preventing you from passing an IPC reader of any kind. In the future we might want to rename this to something like ExternalDataset or PipedDataset or StreamingDataset. * Until we interface with Python async there is no way to really scan this asynchronously. I can either scan it on a background thread or use the CPU thread and simply pray that the reader doesn't block. For now I'll do the latter but going forwards maybe we should split this into two different dataset classes? An InMemoryDataset which wraps a list of batches (or table[s]) and a piped dataset which wraps a reader/iterator? The latter would be consumed by an I/O thread while the former would just get consumed on the CPU thread. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
