[
https://issues.apache.org/jira/browse/ARROW-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458340#comment-17458340
]
Thomas Cercato commented on ARROW-15045:
----------------------------------------
This is an example tree of my data folder:
data_dir
---- exchange_0_dir
-------- symbol_0_dir
------------ 2017-01-01.parquet
------------ 2017-01-02.parquet
------------ date-n.parquet
-------- symbol_1_dir
------------ 2017-01-01.parquet
------------ 2017-01-02.parquet
------------ date-n.parquet
---- exchange_1_dir
--------symbol_0_dir
------------ 2017-01-01.parquet
------------ 2017-01-02.parquet
------------ date-n.parquet
-------- symbol_1_dir
------------ 2017-01-01.parquet
------------ 2017-01-02.parquet
------------ date-n.parquet
If I create a dataset as {{dataset(source='path/to/data_dir/',
format='parquet', partitioning=partitioning(field_names=['exchange',
'asset'])}}, it reads all the exchange directories with their content and since
I have tons of files, that simple instance occupy 2.3GB in memory per process.
So I tried to create an UnionDataset as
{{dataset(source=[dataset(source=exchange, format='parquet',
partitioning=partitioning(field_names=['asset'])) for exchange in
[exchange_0_dir, exchange_6_dir, exchange_9_dir]])}} and it returs that SIGSEGV
error.
I just checked the data folder details, it's *28.3GB for 880219 files*, so
yeah, sorry for that mistake.
> PyArrow SIGSEGV error when using UnionDatasets
> ----------------------------------------------
>
> Key: ARROW-15045
> URL: https://issues.apache.org/jira/browse/ARROW-15045
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 6.0.1
> Environment: Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X.
> Reporter: Thomas Cercato
> Priority: Blocker
> Labels: dataset
>
> h3. The context:
> I am using PyArrow to read a folder structured as
> {{exchange/symbol/date.parquet}}. The folder contains multiple exchanges,
> multiple symbols and multiple files. At the time I am writing the folder is
> about 30GB/1.85M files.
> If I use a single PyArrow Dataset to read/manage the entire folder, the
> simplest process with just the dataset defined will occupy 2.3GB of RAM. The
> problem is, I am instanciating this dataset on multiple processes but since
> every process only needs some exchanges (typically just one), I don't need to
> read all folders and files in every single process.
> So I tried to use a UnionDataset composed of single exchange Dataset. In this
> way, every process just loads the required folder/files as a dataset. By a
> simple test, by doing so every process now occupy just 868MB of RAM, -63%.
> h3. The problem:
> When using a single Dataset for the entire folder/files, I have no problem at
> all. I can read filtered data without problems and it's fast as duck.
> But when I read the UnionDataset filtered data, I always get {{Process
> finished with exit code 139 (interrupted by signal 11: SIGSEGV}} error. So
> after looking every single source of the problem, I noticed that if I create
> a dummy folder with multiple exchanges but just some symbols, in order to
> limit the files amout to read, I don't get that error and it works normally.
> If I then copy new symbols folders (any) I get again that error.
> I came up thinking that the problem is not about my code, but linked instead
> to the amout of files that the UnionDataset is able to manage.
> Am I correct or am I doing something wrong? Thank you all, have a nice day
> and good work.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)