[ 
https://issues.apache.org/jira/browse/ARROW-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Prichard updated ARROW-6791:
-----------------------------------
    Description: 
Memory leak with large string columns crashes the program. This only seems to 
affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier 
similar issues? e.g. [https://github.com/apache/arrow/issues/2624]

Below is a reprex which works in earlier versions, but crashes on read (writing 
is fine) in this one. The real-life version of the data is full of URLs as the 
strings. 

Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) 
on my 16GB Macbook. 

Thanks so much for the excellent tools! 

 

 
{code:java}
import pandas as pd

n_rows = int(1e6)
n_cols = 10
col_length = 100

df = pd.DataFrame()

for i in range(n_cols):
    df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)

print('Generated df', df.shape)
filename = 'tmp.parquet'

print('Writing parquet')
df.to_parquet(filename)

print('Reading parquet')
pd.read_parquet(filename)
{code}
 

 

 

 

 

  was:
Memory leak with large string columns crashes the program. This only seems to 
affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier 
similar issues? e.g. [https://github.com/apache/arrow/issues/2624]

Below is a reprex which works in earlier versions, but crashes on read (writing 
is fine) in this one. The real-life version of the data is full of URLs as the 
strings. 

Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) 
on my 16GB Macbook. 

Thanks so much for the excellent tools! 

 

 
{code:java}
import pandas as pd
n_rows = int(1e6)
n_cols = 10
col_length = 100
df = pd.DataFrame()
for i in range(n_cols):
 df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
print('Generated df', df.shape)
filename = 'tmp.parquet'
print('Writing parquet')
df.to_parquet(filename)
print('Reading parquet')
pd.read_parquet(filename)
{code}
 

 

 

 

 


> Memory Leak 
> ------------
>
>                 Key: ARROW-6791
>                 URL: https://issues.apache.org/jira/browse/ARROW-6791
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.0, 0.14.1
>         Environment: Ubuntu 18.04, 32GB ram, conda-forge installation
>            Reporter: George Prichard
>            Priority: Major
>
> Memory leak with large string columns crashes the program. This only seems to 
> affect 0.14.x  - it works fine for me in 0.13.0. It might be related to 
> earlier similar issues? e.g. [https://github.com/apache/arrow/issues/2624]
> Below is a reprex which works in earlier versions, but crashes on read 
> (writing is fine) in this one. The real-life version of the data is full of 
> URLs as the strings. 
> Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the 
> read) on my 16GB Macbook. 
> Thanks so much for the excellent tools! 
>  
>  
> {code:java}
> import pandas as pd
> n_rows = int(1e6)
> n_cols = 10
> col_length = 100
> df = pd.DataFrame()
> for i in range(n_cols):
>     df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
> print('Generated df', df.shape)
> filename = 'tmp.parquet'
> print('Writing parquet')
> df.to_parquet(filename)
> print('Reading parquet')
> pd.read_parquet(filename)
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to