[I] Write_dataset() does not scale linearly with dataset size [arrow]

via GitHub Tue, 23 Jan 2024 11:10:38 -0800


lmocsi opened a new issue, #39768:
URL: https://github.com/apache/arrow/issues/39768


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'd like to create a sample hive-partitioned dataset in parquet format.
   Parameter "b" controls the amount of data.
   b = 1010000 -> 10000 customer ids / 7440000 records -> Runs in 2 seconds
   b = 1050000 -> 50000 customer ids / 37200000 records -> Runs in 8 minutes
   b = 1100000 -> 100000 customer ids / 74400000 records -> Did not finish in 
19 minutes
   Shouldn't it scale linerarly?
   ```
   #!pip install --upgrade polars==0.20.5
   #!pip install --upgrade pyarrow==15.0.0
    
   import polars as pl
   import pyarrow.dataset as ds
    
   from dateutil import rrule
   from datetime import datetime
   
   def ido():
       return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')
   
   print(ido(),'started')
   
   a = 1000000
   b = 1010000 # Runs in 2 seconds
   #b = 1050000 # Runs in 8 minutes
   #b = 1100000 # Did not finish in 19 minutes
   df1 = pl.DataFrame({'PARTY_ID': [i for i in range(a, b)]})
   df2 = pl.DataFrame({'CALENDAR_DATE': [datetime.strftime(i,'%Y-%m-%d 
%H:%M:%S') for i in list(rrule.rrule(rrule.DAILY, count=186, 
dtstart=datetime(2023, 7, 21)))]})
   df3 = pl.DataFrame({'CREDIT_FL': ['Y','N','Y', 'Y'],
                       'AMOUNT': [123, 789, 22, 44]})
    
   df4 = (df1.join(df2,
                   how='cross'
                   )
             .join(df3,
                   how='cross'
                   )
          )
   print(ido(),'data created')
   print(ido(),df4.shape)
   
   dft = df4.to_arrow()
   print(ido(),'data converted to arrow')
   
   ds.write_dataset(
           dft,
           'my_table',
           format="parquet",
           partitioning=["CALENDAR_DATE"],
           partitioning_flavor="hive",
           existing_data_behavior="delete_matching",
       )
   
   print(ido(),'finished')
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Write_dataset() does not scale linearly with dataset size [arrow]

Reply via email to