[I] How to parallelize RecordBatch reading? [arrow]

via GitHub Sun, 15 Oct 2023 09:34:46 -0700


Luosuu opened a new issue, #38275:
URL: https://github.com/apache/arrow/issues/38275


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Currently I have some arrow streaming format files and I read them into a 
list of lists of RecordBatch
   and I have indices of RecordBatch to read and I would like read them in 
parallel for efficiency.
   
   ```python
   def read_rbatch(rbatch):
       start_time = time.time()
       res = rbatch.take(pa.array([1]))
       end_time = time.time()
       return res, [start_time, end_time]
   
   mmap_files = [pa.memory_map(os.path.join(dir_path, file_name), 'r') for 
file_name in file_names]
   mmap_tables = [pa.ipc.open_stream(memory_mapped_stream).read_all() for 
memory_mapped_stream in mmap_files]
   large_table = pa.concat_tables(mmap_tables)
   batches_list = large_table.to_batches()
   random_indices = np.random.randint(0, len(batches_list)-1, size=32).tolist()
   batches_to_read = [batches_list[idx] for idx in random_indices]
   
   results = []
   with ThreadPoolExecutor() as executor:
       results_generator = executor.map(read_rbatch, batches_to_read)
       results.extend(results_generator)
   
   # Separate the results and timing information
   res_batches, timings = zip(*results)
   
   res_table = pa.Table.from_batches(res_batches)
   print(res_table)
   ```
   
   however, it seems that they are not well parallelized based on the recorded 
time data.
   
   I sorted the start_time and end_time to show this:
   
![time_gantt](https://github.com/apache/arrow/assets/43507393/28a01f0f-08bb-4cfa-a30d-fee7bd124651)
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] How to parallelize RecordBatch reading? [arrow]

Reply via email to