Thomas Cercato created ARROW-15045:
--------------------------------------

             Summary: PyArrow SIGSEGV error when using UnionDatasets
                 Key: ARROW-15045
                 URL: https://issues.apache.org/jira/browse/ARROW-15045
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.1
         Environment: Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X.
            Reporter: Thomas Cercato


h3. The context:
I am using PyArrow to read a folder structured as 
{{exchange/symbol/date.parquet}}. The folder contains multiple exchanges, 
multiple symbols and multiple files. At the time I am writing the folder is 
about 30GB/1.85M files.

If I use a single PyArrow Dataset to read/manage the entire folder, the 
simplest process with just the dataset defined will occupy 2.3GB of RAM. The 
problem is, I am instanciating this dataset on multiple processes but since 
every process only needs some exchanges (typically just one), I don't need to 
read all folders and files in every single process.

So I tried to use a UnionDataset composed of single exchange Dataset. In this 
way, every process just loads the required folder/files as a dataset. By a 
simple test, by doing so every process now occupy just 868MB of RAM, -63%.

h3. The problem:
When using a single Dataset for the entire folder/files, I have no problem at 
all. I can read filtered data without problems and it's fast as duck.

But when I read the UnionDataset filtered data, I always get {{Process finished 
with exit code 139 (interrupted by signal 11: SIGSEGV}} error. So after looking 
every single source of the problem, I noticed that if I create a dummy folder 
with multiple exchanges but just some symbols, in order to limit the files 
amout to read, I don't get that error and it works normally. If I then copy new 
symbols folders (any) I get again that error.

I came up thinking that the problem is not about my code, but linked instead to 
the amout of files that the UnionDataset is able to manage.

Am I correct or am I doing something wrong? Thank you all, have a nice day and 
good work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to