This is an automated email from the ASF dual-hosted git repository.
uwe pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 87c976a ARROW-2855: [C++] Blog post that outlines the benefits of
using jemalloc
87c976a is described below
commit 87c976adcd438d43719912d48e200032729959d3
Author: Korn, Uwe <[email protected]>
AuthorDate: Sat Jul 21 18:04:43 2018 +0200
ARROW-2855: [C++] Blog post that outlines the benefits of using jemalloc
I will update the date once this has passed review.
Author: Korn, Uwe <[email protected]>
Closes #2271 from xhochy/ARROW-2855 and squashes the following commits:
dbb7bc4a <Korn, Uwe> Prepare post for publish
7410ada2 <Korn, Uwe> Review comments and typos
298f4eb8 <Korn, Uwe> ARROW-2855: Blog post that outlines the benefits of
using jemalloc
---
site/_posts/2018-07-20-jemalloc.md | 141 +++++++++++++++++++++++++++++++++++++
1 file changed, 141 insertions(+)
diff --git a/site/_posts/2018-07-20-jemalloc.md
b/site/_posts/2018-07-20-jemalloc.md
new file mode 100644
index 0000000..ea47665
--- /dev/null
+++ b/site/_posts/2018-07-20-jemalloc.md
@@ -0,0 +1,141 @@
+---
+layout: post
+title: "Faster, scalable memory allocations in Apache Arrow with jemalloc"
+date: "2018-07-20 12:00:00 +0100"
+author: uwe
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+With the release of the 0.9 version of Apache Arrow, we have switched our
+default allocator for array buffers from the system allocator to jemalloc on
+OSX and Linux. This applies to the C++/GLib/Python implementations of Arrow.
+In most cases changing the default allocator is normally done to avoid problems
+that occur with many small, frequent (de)allocations. In contrast, in Arrow we
+normally deal with large in-memory datasets. While jemalloc provides good
+strategies for [avoiding RAM fragmentation for allocations that are lower than
+a memory page (4kb)][1], it also provides functionality that improves
+performance on allocations that span several memory pages.
+
+Outside of Apache Arrow, [jemalloc powers the infrastructure of Facebook][2]
+(this is also where most of its development happens). It is also used as the
+[default allocator in Rust][3] as well as it helps [Redis reduce the memory
+fragmentation on Linux][4] ("Allocator").
+
+One allocation specialty that we require in Arrow is that memory should be
+64byte aligned. This is so that we can get the most performance out of SIMD
+instruction sets like AVX. While the most modern SIMD instructions also work on
+unaligned memory, their performance is much better on aligned memory. To get
the
+best performance for our analytical applications, we want all memory to be
+allocated such that SIMD performance is maximized.
+
+For aligned allocations, the POSIX APIs only provide the
+`aligned_alloc(void** ptr, size_t alignment, size_t size)` function to
+allocate aligned memory. There is also
+`posix_memalign(void **ptr, size_t alignment, size_t size)` to modify an
+allocation to the preferred alignment. But neither of them cater for expansions
+of the allocation. While the `realloc` function can often expand allocations
+without moving them physically, it does not ensure that in the case the
+allocation is moved that the alignment is kept.
+
+In the case when Arrow was built without jemalloc being enabled, this resulted
+in copying the data on each new expansion of an allocation. To reduce the
number
+of memory copies, we use jemalloc's `*allocx()`-APIs to create, modify and free
+aligned allocations. One of the typical tasks where this gives us a major
+speedup is on the incremental construction of an Arrow table that consists of
+several columns. We often don't know the size of the table in advance and need
+to expand our allocations as the data is loaded.
+
+To incrementally build a vector using memory expansion of a factor of 2, we
+would use the following C-code with the standard POSIX APIs:
+
+```c
+size_t size = 128 * 1024;
+void* ptr = aligned_alloc(64, size);
+for (int i = 0; i < 10; i++) {
+ size_t new_size = size * 2;
+ void* ptr2 = aligned_alloc(64, new_size);
+ memcpy(ptr2, ptr, size);
+ free(ptr);
+ ptr = ptr2;
+ size = new_size;
+}
+free(ptr);
+```
+
+With jemalloc's special APIs, we are able to omit the explicit call to
`memcpy`.
+In the case where a memory expansion cannot be done in-place, it is still
called
+by the allocator but not needed on all occasions. This simplifies our user code
+to:
+
+```c
+size_t size = 128 * 1024;
+void* ptr = mallocx(size, MALLOCX_ALIGN(64));
+for (int i = 0; i < 10; i++) {
+ size *= 2;
+ ptr = rallocx(ptr, size, MALLOCX_ALIGN(64));
+}
+dallocx(ptr, MALLOCX_ALIGN(64));
+```
+
+To see the real world benefits of using jemalloc, we look at the benchmarks in
+Arrow C++. There we have modeled a typical use case of incrementally building
up
+an array of primitive values. For the build-up of the array, we don't know the
+number of elements in the final array so we need to continuously expand the
+memory region in which the data is stored. The code for this benchmark is part
+of the `builder-benchmark` in the Arrow C++ sources as
+`BuildPrimitiveArrayNoNulls`.
+
+Runtimes without `jemalloc`:
+
+```
+BM_BuildPrimitiveArrayNoNulls/repeats:3 636726 us 804.114MB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3 621345 us 824.019MB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3 625008 us 819.19MB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3_mean 627693 us 815.774MB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3_median 625008 us 819.19MB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev 8034 us 10.3829MB/s
+```
+
+Runtimes with `jemalloc`:
+
+```
+BM_BuildPrimitiveArrayNoNulls/repeats:3 630881 us 811.563MB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3 352891 us 1.41687GB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3 351039 us 1.42434GB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3_mean 444937 us 1.21125GB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3_median 352891 us 1.41687GB/s
+BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev 161035 us 371.335MB/s
+```
+
+The benchmark was run three times for each configuration to see the performance
+differences. The first run in each configuration yielded the same performance
but
+in all subsequent runs, the version using jemalloc was about twice as fast. In
+these cases, the memory region that was used for constructing the array could
be
+expanded in place without moving the data around. This was possible as there
+were memory pages assigned to the process that were unused but not reclaimed by
+the operating system. Without `jemalloc`, we cannot make use of them simply by
+the fact that the default allocator has no API that provides aligned
+reallocation.
+
+[1]: https://zapier.com/engineering/celery-python-jemalloc/
+[2]:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/
+[3]: https://github.com/rust-lang/rust/pull/6895
+[4]: http://download.redis.io/redis-stable/README.md