Re: [I] [Python] How to parallelize RecordBatch reading? [arrow]

via GitHub Wed, 18 Oct 2023 10:40:48 -0700


Luosuu commented on issue #38275:
URL: https://github.com/apache/arrow/issues/38275#issuecomment-1769031185


   @kou Hi, Thank you for reply!
   
   Yes essentially I would like to parallelize Apache Arrow data read and I 
hope I can also understand its mechanism better too.
   
   Sorry for the confusion but let me describe more details of my problem. I 
have 20 arrow files and each of them is 55GB. 
   My application need to read some random rows (for example 32 rows with 
random indices) from these 20 arrow files. 
   
   Since they cannot be directly loaded to RAM at once so they are 
memory-mapped:
   
   ```python
   mmap_files = [pa.memory_map(os.path.join(dir_path, file_name), 'r') for 
file_name in file_names]
   mmap_tables = [pa.ipc.open_stream(memory_mapped_stream).read_all() for 
memory_mapped_stream in mmap_files]
   ```
   
   The real concern is that it seems when the file size is large, the reading 
of multiple RecordBatch becomes slow, which seems not an issue to smaller arrow 
file size. Is this expected?
   
   Initially I was thinking about maybe when the file size is large, reading 
multiple non-contiguous RecordBatch will result in page fault and then block 
the reading. So I was trying to parallelize different RecordBatch's reading so 
that they would not block each other. As @mapleFU mentioned, maybe this is 
caused by "swap".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] How to parallelize RecordBatch reading? [arrow]

Reply via email to