[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

GitBox Tue, 03 Sep 2019 12:11:55 -0700

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320430467


 ##########
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##########
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder<T>
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder<T>
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+<div align="center">
+<img src="{{ site.base-url }}/img/20190903-parquet-dictionary-column-chunk.png"
+     alt="Internal ColumnChunk structure"
+     width="80%" class="img-responsive">
+</div>
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder` without rehashing the data
+* When decoding a ColumnChunk, we first append the dictionary values and
+  indices into an Arrow `DictionaryBuilder`, and when we encounter the "fall
+  back" portion we use a hash table to convert those values to
+  dictionary-encoded form
+* We override the "fall back" logic when writing a ColumnChunk from an
+  `arrow::DictionaryArray` so that reading such data back is more efficient
+
+All of these things together have produced some excellent performance results
+that we will detail below.
+
+# Pushing Arrow columnar read and write lower in the Parquet stack
+
+The other class of optimizations we implemented was removing an abstraction
+layer between the low-level Parquet column data encoder and decoder classes and
+the Arrow columnar data structures. This involves:
+
+* Adding `ColumnWriter::WriteArrow` and `Encoder::Put` methods that accept
+  `arrow::Array` objects directly
+* Adding `ByteArrayDecoder::DecodeArrow` method to decode binary data directly
+  into an `arrow::BinaryBuilder`.
+
+While the performance improvements from this work are less dramatic than for
+dictionary-encoded data, they are still meaningful in real-world applications.
+
+# Performance Benchmarks
+
+We run some benchmarks comparing Arrow 0.11.1 with the current master
+branch. We construct two kinds of Arrow tables with 10 columns each:
+
+* "Low cardinality" and "high cardinality" variants. The "low cardinality" case
+  has 1,000 unique string values of 32-bytes each. The "high cardinality" has
+  100,000 unique values
+* "Dense" (non-dictionary) and "Dictionary" variants
+
+[Click here is the full benchmark script.][11]
+
+We show both single-threaded and multithreaded read performance. The test
+machine is an Intel i9-9960X using gcc 8.3.0 (on Ubuntu 18.04) with 16 physical
+cores and 32 virtual cores.
+
+First, the writing benchmarks:
+
+<div align="center">
+<img src="{{ site.base-url }}/img/20190903_parquet_write_perf.png"
+     alt="Parquet write benchmarks"
+     width="80%" class="img-responsive">
+</div>
+
+Here we note that writing `arrow::DictionaryArray` is dramatically faster due
+to the optimizations described above. We have achieved a small improvement in
+writing dense (non-dictionary) binary arrays.
+
+Then, the reading benchmarks:
+
+<div align="center">
+<img src="{{ site.base-url }}/img/20190903_parquet_read_perf.png"
+     alt="Parquet read benchmarks"
+     width="80%" class="img-responsive">
+</div>
+
+Here, similarly reading `DictionaryArray` directly is many times faster.
+
+In these benchmarks we note that reading the dense binary data is slower in
 
 Review comment:
   ```suggestion
   These benchmarks show that reading the dense binary data is slower in the
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

Reply via email to