Renan Alves Fonseca created ARROW-6380:
------------------------------------------

             Summary: Method pyarrow.parquet.read_table has memory spikes from 
version 0.14
                 Key: ARROW-6380
                 URL: https://issues.apache.org/jira/browse/ARROW-6380
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.14.1, 0.14.0
         Environment: ubuntu 18, 16GB ram, 4 cpus
            Reporter: Renan Alves Fonseca
             Fix For: 0.13.0


Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
version 0.14.0

Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 
0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x

This impact in performance is easily measured. However, there is another 
problem that I could only detect on htop screen. While opening a 40MB parquet, 
the process occupies almost 16GB for some miliseconds. The pyarrow table will 
result in around 300MB in the python process (registered using 
memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to