[ 
https://issues.apache.org/jira/browse/ARROW-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5791.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 0.14.0

Issue resolved by pull request 4762
[https://github.com/apache/arrow/pull/4762]

> [Python] pyarrow.csv.read_csv hangs + eats all RAM
> --------------------------------------------------
>
>                 Key: ARROW-5791
>                 URL: https://issues.apache.org/jira/browse/ARROW-5791
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Ubuntu Xenial, python 2.7
>            Reporter: Bogdan Klichuk
>            Assignee: Micah Kornfield
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>         Attachments: csvtest.py, graph.svg, sample_32768_cols.csv, 
> sample_32769_cols.csv
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I have quite a sparse dataset in CSV format. A wide table that has several 
> rows but many (32k) columns. Total size ~540K.
> When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats 
> all memory and gets killed.
> More details on the conditions further. Script to run and all mentioned files 
> are under attachments.
> 1) `sample_32769_cols.csv` is the dataset that suffers the problem.
> 2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in 
> under 400ms on my machine. It's the same dataset without ONE last column. 
> That last column is no different than others and has empty values.
> The reason of why exactly this column makes difference between proper 
> execution and hanging failure which looks like some memory leak - no idea.
> I have created flame graph for the case (1) to support this issue resolution 
> (`graph.svg`).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to