Re: [I] [Python] Method to lazily read a collection of multiple Arrow IPC stream files [arrow]

via GitHub Tue, 29 Oct 2024 09:26:42 -0700


ianmcook commented on issue #44561:
URL: https://github.com/apache/arrow/issues/44561#issuecomment-2444773143


   It's easy enough to create a record batch reader from a collection of 
multiple Arrow IPC stream files with the same schema like this:
   
   ```py
   import pyarrow as pa
   import glob
   
   def get_schema(paths):
       with open(path, "rb") as file:
           reader = pa.ipc.open_stream(file)
           return reader.schema
   
   def get_batches(paths):
       for path in paths:
           with open(path, "rb") as file:
               reader = pa.ipc.open_stream(file)
               for batch in reader:
                   yield batch
   
   paths = glob.glob("*.arrows")
   
   reader = pa.ipc.RecordBatchStreamReader.from_batches(
       get_schema(paths),
       yield_batches(paths)
   )
   ```
   
   Still, it would be nice to have a method in PyArrow that makes this more 
efficient and concise to express.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Method to lazily read a collection of multiple Arrow IPC stream files [arrow]

Reply via email to