[jira] [Created] (ARROW-8801) Pyarrow leaks memory on read from parquet file with UTC timestamps using pandas

Rauli Ruohonen (Jira) Thu, 14 May 2020 07:54:27 -0700

Rauli Ruohonen created ARROW-8801:
-------------------------------------

             Summary: Pyarrow leaks memory on read from parquet file with UTC 
timestamps using pandas
                 Key: ARROW-8801
                 URL: https://issues.apache.org/jira/browse/ARROW-8801
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.17.0, 0.16.0
         Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, 
mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, 
ubuntu 20.04 (linux).
            Reporter: Rauli Ruohonen



Given dump.py script 

 
{code:java}
import pandas as pd
import numpy as np


x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True)
pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', 
compression=None)
{code}
and load.py script

 
{code:java}
import sys
import pandas as pd

def foo(engine):
    for _ in range(2**9):
        pd.read_parquet('data.parquet', engine=engine)
    print('Done')
    input()

foo(sys.argv[1])
{code}
running first "python dump.py" and then "python load.py pyarrow", on my machine 
python memory usage stays at 4+ GB while it waits for input. If using "python 
load.py fastparquet" instead, it is about 100 MB, so it should be a pyarrow 
issue instead of a pandas issue. The leak disappears if "utc=True" is removed 
from dump.py, in which case the timestamp is timezone-unaware.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8801) Pyarrow leaks memory on read from parquet file with UTC timestamps using pandas

Reply via email to