date:20190903

[arrow] branch master updated (4a7dd43 -> d818299)

2019-09-03 Thread kou

This is an automated email from the ASF dual-hosted git repository.

kou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 4a7dd43  ARROW-6415: [R] Remove usage of R CMD config CXXCPP
 add d818299  ARROW-6358: [C++] Add FileSystem::DeleteDirContents

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/filesystem/filesystem.cc  |  5 +++
 cpp/src/arrow/filesystem/filesystem.h   |  7 +
 cpp/src/arrow/filesystem/localfs.cc | 12 
 cpp/src/arrow/filesystem/localfs.h  |  1 +
 cpp/src/arrow/filesystem/mockfs.cc  | 28 +++--
 cpp/src/arrow/filesystem/mockfs.h   |  1 +
 cpp/src/arrow/filesystem/s3fs.cc| 41 -
 cpp/src/arrow/filesystem/s3fs.h |  1 +
 cpp/src/arrow/filesystem/s3fs_narrative_test.cc | 20 +---
 cpp/src/arrow/filesystem/test_util.cc   | 25 +++
 cpp/src/arrow/filesystem/test_util.h|  3 ++
 cpp/src/arrow/util/io_util.cc   | 28 -
 cpp/src/arrow/util/io_util.h|  3 ++
 cpp/src/arrow/util/io_util_test.cc  | 41 +
 14 files changed, 180 insertions(+), 36 deletions(-)

[arrow] branch master updated (6b714a1 -> 4a7dd43)

2019-09-03 Thread kou

This is an automated email from the ASF dual-hosted git repository.

kou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 6b714a1  ARROW-6423: [C++] Fix crash when trying to instantiate Snappy 
CompressedOutputStream
 add 4a7dd43  ARROW-6415: [R] Remove usage of R CMD config CXXCPP

No new revisions were added by this update.

Summary of changes:
 r/configure | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

[arrow] branch master updated (327057e -> 6b714a1)

2019-09-03 Thread kou

This is an automated email from the ASF dual-hosted git repository.

kou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 327057e  ARROW-6424: [C++] Fix IPC fuzzing test name
 add 6b714a1  ARROW-6423: [C++] Fix crash when trying to instantiate Snappy 
CompressedOutputStream

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/io/compressed.cc  |  3 ++-
 cpp/src/arrow/io/compressed_test.cc | 18 ++
 python/pyarrow/io.pxi   |  6 ++
 3 files changed, 22 insertions(+), 5 deletions(-)

[GitHub] [arrow-site] wesm commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

wesm commented on a change in pull request #19: ARROW-6419: [Website] Blog post 
about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320453511
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
 
 Review comment:
   Lowercase is the official styling of the project name =) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] wesm commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

wesm commented on a change in pull request #19: ARROW-6419: [Website] Blog post 
about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320453000
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder` without

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320433177
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder`

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320432915
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
 
 Review comment:
   This seems to start a new section, where you've moved on from "background" 
and are starting to discuss your optimizations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320432044
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
 
 Review comment:
   lowercase "pandas" reads odd to start a sentence


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320430931
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder`

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320431812
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
 
 Review comment:
   I'd move this paragraph to after the list of Jiras. This is an odd way to 
start a "summary"


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320430285
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder`

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320432165
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
 
 Review comment:
   Link to that please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320429352
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder`

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320431254
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
 
 Review comment:
   ```suggestion
   This post reviews the work that was done and shows benchmarks comparing 
Arrow 0.11.0
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320429697
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder`

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

2019-09-03 Thread GitBox

nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] 
Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r320430467
 
 

 ##
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##
 @@ -0,0 +1,233 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on string-heavy data coming in 
Apache Arrow 0.15"
+date: "2019-09-01 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string types, including native support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+We discuss the work that was done and show benchmarks comparing Arrow 0.11.0
+(released in October, 2018) with the current development version (to be
+released soon as Arrow 0.15.0).
+
+# Summary of work
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
+without impacting non-Arrow Parquet users, which includes database systems like
+Clickhouse and Vertica.
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers 
([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+# Background: how Parquet files do dictionary encoding
+
+Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. pandas users will know this as the [Categorical type][8]
+while in R such encoding is known as [`factor`][9]. In the Arrow C++ library
+and various bindings we have the `DictionaryArray` object for representing such
+data in memory.
+
+For example, an array such as
+
+```
+['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+```
+
+has dictionary-encoded form
+
+```
+dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+```
+
+The [Parquet format uses dictionary encoding][10] to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like
+
+```
+['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']
+```
+
+the indices would be encoded like
+
+```
+[rle-run=(6, 0),
+ bit-packed-run=[1]]
+```
+
+The full details of the rle-bitpacking encoding are found in the Parquet
+specification.
+
+When writing a Parquet file, most implementations will use dictionary encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+"fall back" to `PLAIN` encoding where values are written end-to-end in "data
+pages" and then usually compressed with Snappy or Gzip. See the following rough
+diagram:
+
+
+
+
+
+When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this "dense" materialization. There are several issues to deal with:
+
+* A Parquet file often contains multiple ColumnChunks for each semantic column,
+  and the dictionary values may be different in each ColumnChunk
+* We must gracefully handle the "fall back" portion which is not
+  dictionary-encoded
+
+We pursued several avenues to help with this:
+
+* Allowing each `arrow::DictionaryArray` to have a different dictionary
+  (before, the dictionary was part of the `DictionaryType`, which caused
+  problems)
+* We enabled the Parquet dictionary indices to be directly written into an
+  Arrow `DictionaryBuilder`

[arrow] branch master updated (561f86d -> 327057e)

2019-09-03 Thread apitrou

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 561f86d  ARROW-6412: [C++] Improve TCP port allocation in tests
 add 327057e  ARROW-6424: [C++] Fix IPC fuzzing test name

No new revisions were added by this update.

Summary of changes:
 cpp/cmake_modules/BuildUtils.cmake| 3 +++
 cpp/src/arrow/filesystem/s3fs_test.cc | 2 --
 2 files changed, 3 insertions(+), 2 deletions(-)

[arrow] branch master updated (48d9069 -> 561f86d)

2019-09-03 Thread apitrou

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 48d9069  ARROW-6422: [Gandiva] Fix double-conversion linker issue
 add 561f86d  ARROW-6412: [C++] Improve TCP port allocation in tests

No new revisions were added by this update.

Summary of changes:
 cpp/CMakeLists.txt|  5 +++
 cpp/src/arrow/testing/util.cc | 72 +++
 cpp/src/arrow/testing/util.h  |  2 --
 3 files changed, 71 insertions(+), 8 deletions(-)

[arrow] branch master updated (c39e350 -> 48d9069)

2019-09-03 Thread praveenbingo

This is an automated email from the ASF dual-hosted git repository.

praveenbingo pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from c39e350  ARROW-5610: [Python] define extension types in Python
 add 48d9069  ARROW-6422: [Gandiva] Fix double-conversion linker issue

No new revisions were added by this update.

Summary of changes:
 dev/tasks/gandiva-jars/build-cpp-linux.sh | 1 -
 1 file changed, 1 deletion(-)

[arrow] branch master updated (9517ade -> c39e350)

2019-09-03 Thread apitrou

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from 9517ade  ARROW-6269: [C++] check decimal precision in IPC code
 add c39e350  ARROW-5610: [Python] define extension types in Python

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/python/extension_type.cc  |  26 ++--
 cpp/src/arrow/python/extension_type.h   |  11 +-
 docs/source/format/Columnar.rst |   2 +
 docs/source/python/api/arrays.rst   |   1 +
 docs/source/python/api/datatypes.rst|  11 ++
 docs/source/python/extending_types.rst  | 180 
 python/pyarrow/__init__.py  |   3 +-
 python/pyarrow/includes/libarrow.pxd|   7 +-
 python/pyarrow/lib.pxd  |   4 +
 python/pyarrow/public-api.pxi   |   9 +-
 python/pyarrow/tests/test_extension_type.py | 145 +-
 python/pyarrow/types.pxi| 129 ++--
 12 files changed, 496 insertions(+), 32 deletions(-)

[arrow] branch master updated (ab908cc -> 9517ade)

2019-09-03 Thread apitrou

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git.


from ab908cc  ARROW-6411: [Python][Parquet] Improve performance of 
DictEncoder::PutIndices
 add 9517ade  ARROW-6269: [C++] check decimal precision in IPC code

No new revisions were added by this update.

Summary of changes:
 cpp/src/arrow/ipc/metadata_internal.cc |  3 +--
 cpp/src/arrow/type.cc  | 13 +++--
 cpp/src/arrow/type.h   |  7 +++
 3 files changed, 19 insertions(+), 4 deletions(-)

[arrow] branch master updated (4a7dd43 -> d818299)

[arrow] branch master updated (6b714a1 -> 4a7dd43)

[arrow] branch master updated (327057e -> 6b714a1)

[GitHub] [arrow-site] wesm commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] wesm commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[GitHub] [arrow-site] nealrichardson commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

[arrow] branch master updated (561f86d -> 327057e)

[arrow] branch master updated (48d9069 -> 561f86d)

[arrow] branch master updated (c39e350 -> 48d9069)

[arrow] branch master updated (9517ade -> c39e350)

[arrow] branch master updated (ab908cc -> 9517ade)

21 matches

Site Navigation

Mail list logo

Footer information