Josh-Hiz opened a new issue, #43213:
URL: https://github.com/apache/arrow/issues/43213
### Describe the usage question you have. Please include as many useful
details as possible.
I am trying to use a ThreadPoolExecutor in order to process large queries
into smaller batches that can be quickly converted to pandas, however whenever
I try calling ``to_batches`` on a PyArrow DataSet it keeps just getting 1 row
per batch, meaning for a 1000 row table, it creates 1000 batches, even if I set
the batch_size to a number greater than the number of rows (meaning one batch
should be made). Am I using this function incorrectly? What exactly am I
missing?
Code:
```python
dataset =
dt.to_pyarrow_dataset(parquet_read_options=ParquetReadOptions(coerce_int96_timestamp_unit="ms"))
batch_iter = dataset.to_batches(columns=columns, batch_size=4000) # The size
of the data is 1000 rows
dfs: list[pd.DataFrame] = []
def process_data(batch_data) -> pd.DataFrame:
print(f"Processing batch on thread {threading.current_thread().name},
DFS: {len(dfs)}")
return batch_data.to_pandas()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as
executor:
future_to_data = {executor.submit(process_data, batch): batch for batch
in batch_iter}
for future in concurrent.futures.as_completed(future_to_data):
try:
data = future.result()
dfs.append(data)
except Exception as e:
print(f"Error occurred converting batch: {e}")
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]