[ 
https://issues.apache.org/jira/browse/ARROW-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458380#comment-17458380
 ] 

Thomas Cercato commented on ARROW-15045:
----------------------------------------

Currently the workstation is running a long computation, so I can't run the 
code with gdb. Is gdb any different from the default debugger available in 
PyCharm? Because I am not that familiar with UNIX, gdb and debuggers in 
general, but I find built in tools from JetBrains easy to understand.

I am creating date/files because it's easier to manage them when fetching data 
from the exchanges. If I discover corrupted or incomplete data for a particular 
date, I have to fetch the data for just a single date, that on average is 
equivalent to 3 calls to the exchange. This give me control over the entire 
process of fetch/store/load/evaluate. Monthly files are bigger and if I have to 
fetch just part of them then the entire file must be loaded in memory, modified 
and stored again. With yearly files this is even worse. And no, weeks are 
excluded, they are a pain in the ass to manage on year change. I hate weeks.

> PyArrow SIGSEGV error when using UnionDatasets
> ----------------------------------------------
>
>                 Key: ARROW-15045
>                 URL: https://issues.apache.org/jira/browse/ARROW-15045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.1
>         Environment: Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X.
>            Reporter: Thomas Cercato
>            Priority: Blocker
>              Labels: dataset
>
> h3. The context:
> I am using PyArrow to read a folder structured as 
> {{exchange/symbol/date.parquet}}. The folder contains multiple exchanges, 
> multiple symbols and multiple files. At the time I am writing the folder is 
> about 30GB/1.85M files.
> If I use a single PyArrow Dataset to read/manage the entire folder, the 
> simplest process with just the dataset defined will occupy 2.3GB of RAM. The 
> problem is, I am instanciating this dataset on multiple processes but since 
> every process only needs some exchanges (typically just one), I don't need to 
> read all folders and files in every single process.
> So I tried to use a UnionDataset composed of single exchange Dataset. In this 
> way, every process just loads the required folder/files as a dataset. By a 
> simple test, by doing so every process now occupy just 868MB of RAM, -63%.
> h3. The problem:
> When using a single Dataset for the entire folder/files, I have no problem at 
> all. I can read filtered data without problems and it's fast as duck.
> But when I read the UnionDataset filtered data, I always get {{Process 
> finished with exit code 139 (interrupted by signal 11: SIGSEGV}} error. So 
> after looking every single source of the problem, I noticed that if I create 
> a dummy folder with multiple exchanges but just some symbols, in order to 
> limit the files amout to read, I don't get that error and it works normally. 
> If I then copy new symbols folders (any) I get again that error.
> I came up thinking that the problem is not about my code, but linked instead 
> to the amout of files that the UnionDataset is able to manage.
> Am I correct or am I doing something wrong? Thank you all, have a nice day 
> and good work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to