[I] How to use ``to_batches()`` with ThreadPoolExecutor and PyArrow? [arrow]

via GitHub Wed, 10 Jul 2024 13:09:34 -0700


Josh-Hiz opened a new issue, #43213:
URL: https://github.com/apache/arrow/issues/43213


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I am trying to use a ThreadPoolExecutor in order to process large queries 
into smaller batches that can be quickly converted to pandas, however whenever 
I try calling ``to_batches`` on a PyArrow DataSet it keeps just getting 1 row 
per batch, meaning for a 1000 row table, it creates 1000 batches, even if I set 
the batch_size to a number greater than the number of rows (meaning one batch 
should be made). Am I using this function incorrectly? What exactly am I 
missing?
   
   Code:
   ```python
    dataset = 
dt.to_pyarrow_dataset(parquet_read_options=ParquetReadOptions(coerce_int96_timestamp_unit="ms"))
   
   batch_iter = dataset.to_batches(columns=columns, batch_size=4000) # The size 
of the data is 1000 rows
   
   dfs: list[pd.DataFrame] = []
   
   def process_data(batch_data) -> pd.DataFrame:
       print(f"Processing batch on thread {threading.current_thread().name}, 
DFS: {len(dfs)}")
       return batch_data.to_pandas()
   with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as 
executor:
       future_to_data = {executor.submit(process_data, batch): batch for batch 
in batch_iter}
       for future in concurrent.futures.as_completed(future_to_data):
           try:
               data = future.result()
               dfs.append(data)
           except Exception as e:
               print(f"Error occurred converting batch: {e}")
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] How to use ``to_batches()`` with ThreadPoolExecutor and PyArrow? [arrow]

Reply via email to