[
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated PARQUET-2106:
-------------------------------------
Description:
*Background*
While writing out large Parquet tables using Spark, we've noticed that
BinaryComparator is the source of substantial churn of extremely short-lived
`HeapByteBuffer` objects – It's taking up to *16%* of total amount of
allocations in our benchmarks, putting substantial pressure on a Garbage
Collector:
!Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
[^profile_48449_alloc_1638494450_sort_by.html]
*Proposal*
We're proposing to adjust lexicographical comparison (at least) to avoid doing
any allocations, since this code lies on the hot-path of every Parquet write,
therefore causing substantial churn amplification.
was:
While writing out large Parquet tables using Spark, we've noticed that
BinaryComparator is the source of substantial churn of extremely short-lived
`HeapByteBuffer` objects – It's taking up to *16%* of total amount of
allocations in our benchmarks, putting substantial pressure on a Garbage
Collector:
!Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
[^profile_48449_alloc_1638494450_sort_by.html]
Summary: BinaryComparator should avoid doing ByteBuffer.wrap in the
hot-path (was: BinaryComparator should avoid doing ByteBuffer.wrap)
> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> -------------------------------------------------------------------
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
> Issue Type: Task
> Components: parquet-mr
> Affects Versions: 1.12.2
> Reporter: Alexey Kudinkin
> Priority: Major
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png,
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that
> BinaryComparator is the source of substantial churn of extremely short-lived
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of
> allocations in our benchmarks, putting substantial pressure on a Garbage
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid
> doing any allocations, since this code lies on the hot-path of every Parquet
> write, therefore causing substantial churn amplification.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)