westonpace commented on issue #40224:
URL: https://github.com/apache/arrow/issues/40224#issuecomment-1998920711
> OK, so I've been experimenting with various combinations of this, and have
found that it happens with both Python and R, so looks like a C++ issue.
>
> I'm running this in a Docker container I've created based off
`ubuntu:latest`, with 8Gb of RAM, 2Gb of swap, and 50% of my CPU.
>
> Here's what I've found so far:
>
> * everything is fine when I partition on an existing variable and a
new one that I've created via projection (so I think the new column thing was a
red herring), even if it's slow, it eventually completes
>
> * as soon as I partition on 3 variables, it eventually crashes (both
in Python and R)
>
>
> Here's an example in pyarrow using the NYC taxi dataset (this should
result in 924 partitions):
>
> ```python
> import pyarrow.dataset as ds
> import pyarrow as pa
>
> dataset = ds.dataset("data", partitioning="hive")
>
> target_dir = "data2"
>
> ds.write_dataset(
> dataset,
> target_dir,
> partitioning=["year", "month", "rate_code"]
> )
> ```
>
> I was wondering if it was related to the number of partitions, though when
I run this example (which should have fewer partitions - 396), it also eats
memory until the Python process is killed.
>
> ```python
> import pyarrow.dataset as ds
> import pyarrow as pa
>
> dataset = ds.dataset("data", partitioning="hive")
>
> target_dir = "data2"
>
> ds.write_dataset(
> dataset,
> target_dir,
> partitioning=["year", "month", "vendor_name"]
> )
> ```
>
> Happy to try to investigate further and try to get some output, but I
wasn't sure whether it'd be more useful to log the memory via `free` or run
with the debugger attached and log the output?
Is the `data` in these examples something I can easily download? I will try
and reproduce / study this on the weekend.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]