[ 
https://issues.apache.org/jira/browse/ARROW-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875623#comment-16875623
 ] 

Brian Hulette commented on ARROW-5791:
--------------------------------------

Thanks for the concise bug report! I haven't had a chance to dig into this very 
far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max 
of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each 
column somewhere it would overflow once you get beyond 32768 columns (since one 
column gets 0).

I'm not sure where exactly that would be happening though. My first inclination 
was that it would be in the element count for the [vector of 
fields|https://github.com/apache/arrow/blob/master/format/Schema.fbs#L321], but 
according to the [flatbuffers 
page|https://google.github.io/flatbuffers/flatbuffers_internals.html] vectors 
are prefixed by a 32-bit element count.

> pyarrow.csv.read_csv hangs + eats all RAM
> -----------------------------------------
>
>                 Key: ARROW-5791
>                 URL: https://issues.apache.org/jira/browse/ARROW-5791
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Ubuntu Xenial, python 2.7
>            Reporter: Bogdan Klichuk
>            Priority: Major
>         Attachments: csvtest.py, graph.svg, sample_32768_cols.csv, 
> sample_32769_cols.csv
>
>
> I have quite a sparse dataset in CSV format. A wide table that has several 
> rows but many (32k) columns. Total size ~540K.
> When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats 
> all memory and gets killed.
> More details on the conditions further. Script to run and all mentioned files 
> are under attachments.
> 1) `sample_32769_cols.csv` is the dataset that suffers the problem.
> 2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in 
> under 400ms on my machine. It's the same dataset without ONE last column. 
> That last column is no different than others and has empty values.
> The reason of why exactly this column makes difference between proper 
> execution and hanging failure which looks like some memory leak - no idea.
> I have created flame graph for the case (1) to support this issue resolution 
> (`graph.svg`).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to