[ 
https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992615#comment-16992615
 ] 

Wes McKinney commented on ARROW-7305:
-------------------------------------

There may be some things we could do about this, do you have an example file we 
could use to help with profiling the internal memory allocations during the 
write process? 

> [Python] High memory usage writing pyarrow.Table with large strings to parquet
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-7305
>                 URL: https://issues.apache.org/jira/browse/ARROW-7305
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Mac OSX
>            Reporter: Bogdan Klichuk
>            Priority: Major
>              Labels: parquet
>
> My case of datasets stored is specific. I have large strings (1-100MB each).
> Let's take for example a single row.
> 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.
> When I read this csv with pandas and then dump to parquet, my script consumes 
> 10x of the 43mb.
> With increasing amount of such rows memory footprint overhead diminishes, but 
> I want to focus on this specific case.
> Here's the footprint after running using memory profiler:
> {code:java}
> Line #    Mem usage    Increment   Line Contents
> ================================================
>      4     48.9 MiB     48.9 MiB   @profile
>      5                             def test():
>      6    143.7 MiB     94.7 MiB       data = pd.read_csv('43mb.csv')
>      7    498.6 MiB    354.9 MiB       data.to_parquet('out.parquet')
>  {code}
> Is this typical for parquet in case of big strings?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to