thisisnic commented on issue #40224:
URL: https://github.com/apache/arrow/issues/40224#issuecomment-1998454021

   OK, so I've been experimenting with various combinations of this, and have 
found that it happens with both Python and R, so looks like a C++ issue.  
   
   I'm running this in a Docker container I've created based off 
`ubuntu:latest`, with 8Gb of RAM, 2Gb of swap, and 50% of my CPU.
   
   Here's what I've found so far:
   * everything is fine when I partition on an existing variable and a new one 
that I've created via projection (so I think the new column thing was a red 
herring), even if it's slow, it eventually completes
   * as soon as I partition on 3 variables, it eventually crashes (both in 
Python and R)
   
   Here's an example in pyarrow using the NYC taxi dataset (this should result 
in 924 partitions):
   
   ```py
   import pyarrow.dataset as ds
   import pyarrow as pa
   
   dataset = ds.dataset("data", partitioning="hive")
   
   target_dir = "data2"
   
   ds.write_dataset(
       dataset,
       target_dir,
       partitioning=["year", "month", "rate_code"]
   )
   ```
   
   I was wondering if it was related to the number of partitions, though when I 
run this example (which should have fewer partitions -  396), it also eats 
memory until the Python process is killed.
   
   ```py
   import pyarrow.dataset as ds
   import pyarrow as pa
   
   dataset = ds.dataset("data", partitioning="hive")
   
   target_dir = "data2"
   
   ds.write_dataset(
       dataset,
       target_dir,
       partitioning=["year", "month", "vendor_name"]
   )
   ```
   
   Happy to try to investigate further and try to get some output, but I wasn't 
sure whether it'd be more useful to log the memory via `free` or run with the 
debugger attached and log the output?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to