[
https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney resolved ARROW-8801.
---------------------------------
Resolution: Fixed
Issue resolved by pull request 7522
[https://github.com/apache/arrow/pull/7522]
> [Python] Memory leak on read from parquet file with UTC timestamps using
> pandas
> -------------------------------------------------------------------------------
>
> Key: ARROW-8801
> URL: https://issues.apache.org/jira/browse/ARROW-8801
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5,
> mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2,
> ubuntu 20.04 (linux).
> Reporter: Rauli Ruohonen
> Assignee: Wes McKinney
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> Given dump.py script
>
> {code:java}
> import pandas as pd
> import numpy as np
> x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms',
> utc=True)
> pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow',
> compression=None)
> {code}
> and load.py script
>
> {code:java}
> import sys
> import pandas as pd
> def foo(engine):
> for _ in range(2**9):
> pd.read_parquet('data.parquet', engine=engine)
> print('Done')
> input()
> foo(sys.argv[1])
> {code}
> running first "python dump.py" and then "python load.py pyarrow", on my
> machine python memory usage stays at 4+ GB while it waits for input. If using
> "python load.py fastparquet" instead, it is about 100 MB, so it should be a
> pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is
> removed from dump.py, in which case the timestamp is timezone-unaware.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)