[ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated PARQUET-2106:
-------------------------------------
    Description: 
*Background*

While writing out large Parquet tables using Spark, we've noticed that 
BinaryComparator is the source of substantial churn of extremely short-lived 
`HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
allocations in our benchmarks, putting substantial pressure on a Garbage 
Collector:

!Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!

[^profile_48449_alloc_1638494450_sort_by.html]

 

*Proposal*

We're proposing to adjust lexicographical comparison (at least) to avoid doing 
any allocations, since this code lies on the hot-path of every Parquet write, 
therefore causing substantial churn amplification.

 

 

 

  was:
While writing out large Parquet tables using Spark, we've noticed that 
BinaryComparator is the source of substantial churn of extremely short-lived 
`HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
allocations in our benchmarks, putting substantial pressure on a Garbage 
Collector:

!Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!

[^profile_48449_alloc_1638494450_sort_by.html]

        Summary: BinaryComparator should avoid doing ByteBuffer.wrap in the 
hot-path  (was: BinaryComparator should avoid doing ByteBuffer.wrap)

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> -------------------------------------------------------------------
>
>                 Key: PARQUET-2106
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2106
>             Project: Parquet
>          Issue Type: Task
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Alexey Kudinkin
>            Priority: Major
>         Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to