lmocsi opened a new issue, #39768:
URL: https://github.com/apache/arrow/issues/39768
### Describe the bug, including details regarding any error messages,
version, and platform.
I'd like to create a sample hive-partitioned dataset in parquet format.
Parameter "b" controls the amount of data.
b = 1010000 -> 10000 customer ids / 7440000 records -> Runs in 2 seconds
b = 1050000 -> 50000 customer ids / 37200000 records -> Runs in 8 minutes
b = 1100000 -> 100000 customer ids / 74400000 records -> Did not finish in
19 minutes
Shouldn't it scale linerarly?
```
#!pip install --upgrade polars==0.20.5
#!pip install --upgrade pyarrow==15.0.0
import polars as pl
import pyarrow.dataset as ds
from dateutil import rrule
from datetime import datetime
def ido():
return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')
print(ido(),'started')
a = 1000000
b = 1010000 # Runs in 2 seconds
#b = 1050000 # Runs in 8 minutes
#b = 1100000 # Did not finish in 19 minutes
df1 = pl.DataFrame({'PARTY_ID': [i for i in range(a, b)]})
df2 = pl.DataFrame({'CALENDAR_DATE': [datetime.strftime(i,'%Y-%m-%d
%H:%M:%S') for i in list(rrule.rrule(rrule.DAILY, count=186,
dtstart=datetime(2023, 7, 21)))]})
df3 = pl.DataFrame({'CREDIT_FL': ['Y','N','Y', 'Y'],
'AMOUNT': [123, 789, 22, 44]})
df4 = (df1.join(df2,
how='cross'
)
.join(df3,
how='cross'
)
)
print(ido(),'data created')
print(ido(),df4.shape)
dft = df4.to_arrow()
print(ido(),'data converted to arrow')
ds.write_dataset(
dft,
'my_table',
format="parquet",
partitioning=["CALENDAR_DATE"],
partitioning_flavor="hive",
existing_data_behavior="delete_matching",
)
print(ido(),'finished')
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]