izapolsk opened a new issue, #14699:
URL: https://github.com/apache/arrow/issues/14699
### Describe the bug, including details regarding any error messages,
version, and platform.
I encountered an issue in pyarrow 10.0.0 with CompressedOutputStream.
It's unable to compress files bigger than 16 GB. I tried several times with
different arrow files.
environment: debian/ubuntu
```python
import pyarrow as pa
from pathlib import Path
import datasets as ds
#%%
pa.__version__
>> '10.0.0'
#%%
data_dir = Path('~/tmp').expanduser()
big_dataset = data_dir.joinpath('train.arrow')
#%%
!ls -lh ~/tmp/train.arrow
>> -rw-rw-r-- 1 yzapols yzapols 28G Nov 11 13:35 ~/tmp/train.arrow
#%%
!md5sum ~/tmp/train.arrow
>>5afe31d206ce07249c127e067bcfa0fb ~/tmp/train.arrow
#%%
schema = pa.schema([...])
#%%
compressed_dataset = data_dir.joinpath('train.arrow.bz2')
with pa.ipc.open_stream(str(big_dataset)) as istream:
with pa.OSFile(str(compressed_dataset), 'wb') as output_file:
with pa.CompressedOutputStream(output_file, compression='bz2') as
ostream:
with pa.RecordBatchStreamWriter(ostream, schema) as writer:
try:
while True:
writer.write_batch(istream.read_next_batch())
except StopIteration:
print('done')
>> done
#%%
!ls -lh ~/tmp/train.arrow.bz2
>> -rw-rw-r-- 1 yzapols yzapols 2.4G Nov 11 17:54 ~/tmp/train.arrow.bz2
#%%
!mv ~/tmp/train.arrow ~/tmp/train.arrow.old
#%%
!bunzip2 -k ~/tmp/train.arrow.bz2
#%%
!ls -lh ~/tmp/train.arrow
>> -rw-rw-r-- 1 yzapols yzapols 16G Nov 11 17:54 ~/tmp/train.arrow
#%%
!md5sum ~/tmp/train.arrow
>> 2460c2c81c5c8672f4b488cfa2ecd8c1 ~/tmp/train.arrow
```
### Component
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]