[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9702: ARROW-11297: [C++][Python] Add ORC writer options

GitBox Mon, 29 Nov 2021 05:18:44 -0800


jorisvandenbossche commented on a change in pull request #9702:
URL: https://github.com/apache/arrow/pull/9702#discussion_r758307660




##########
File path: cpp/src/arrow/adapters/orc/adapter_options.h
##########
@@ -0,0 +1,269 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <set>
+#include <sstream>
+
+#include "arrow/io/interfaces.h"
+#include "arrow/util/visibility.h"
+#include "orc/OrcFile.hh"
+
+namespace liborc = orc;
+
+namespace arrow {
+
+namespace adapters {
+
+namespace orc {
+// Components of ORC Writer Options
+
+struct WriterOptionsPrivate;
+
+enum CompressionKind {
+  CompressionKind_NONE = 0,
+  CompressionKind_ZLIB = 1,
+  CompressionKind_SNAPPY = 2,
+  CompressionKind_LZO = 3,
+  CompressionKind_LZ4 = 4,
+  CompressionKind_ZSTD = 5,
+  CompressionKind_MAX = INT32_MAX
+};
+
+enum CompressionStrategy {
+  CompressionStrategy_SPEED = 0,
+  CompressionStrategy_COMPRESSION
+};
+
+enum RleVersion { RleVersion_1 = 0, RleVersion_2 = 1 };
+
+enum BloomFilterVersion {
+  // Include both the BLOOM_FILTER and BLOOM_FILTER_UTF8 streams to support
+  // both old and new readers.
+  ORIGINAL = 0,
+  // Only include the BLOOM_FILTER_UTF8 streams that consistently use UTF8.
+  // See ORC-101
+  UTF8 = 1,
+  FUTURE = INT32_MAX
+};
+
+class ARROW_EXPORT FileVersion {
+ private:
+  uint32_t major_version;
+  uint32_t minor_version;
+
+ public:
+  static const FileVersion& v_0_11();
+  static const FileVersion& v_0_12();
+
+  FileVersion(uint32_t major, uint32_t minor)
+      : major_version(major), minor_version(minor) {}
+
+  /**
+   * Get major version
+   */
+  uint32_t major() const { return this->major_version; }
+
+  /**
+   * Get minor version
+   */
+  uint32_t minor() const { return this->minor_version; }
+
+  bool operator==(const FileVersion& right) const {
+    return this->major_version == right.major() && this->minor_version == 
right.minor();
+  }
+
+  bool operator!=(const FileVersion& right) const { return !(*this == right); }
+
+  std::string ToString() const;
+};
+
+/**
+ * Options for creating a Writer.
+ */
+class ARROW_EXPORT WriterOptions {

Review comment:
       Small naming comment: `WriterOptions` is consistent with ORC, but I 
think `WriteOptions` might be more consistent with other Arrow parts (or maybe 
`OrcWriteOptions`)

##########
File path: python/pyarrow/_orc.pyx
##########
@@ -38,6 +38,283 @@ from pyarrow.lib cimport (check_status, _Weakrefable,
                           get_writer)
 from pyarrow.lib import tobytes
 
+cdef compression_kind_from_enum(CompressionKind compression_kind_):
+    return {
+        _CompressionKind_NONE: 'NONE',
+        _CompressionKind_ZLIB: 'ZLIB',
+        _CompressionKind_SNAPPY: 'SNAPPY',
+        _CompressionKind_LZO: 'LZO',
+        _CompressionKind_LZ4: 'LZ4',
+        _CompressionKind_ZSTD: 'ZSTD',
+    }.get(compression_kind_, 'UNKNOWN')
+
+cdef CompressionKind compression_kind_from_name(name):
+    name = name.upper()
+    if name == 'ZLIB':
+        return _CompressionKind_ZLIB
+    elif name == 'SNAPPY':
+        return _CompressionKind_SNAPPY
+    elif name == 'LZO':
+        return _CompressionKind_LZO
+    elif name == 'LZ4':
+        return _CompressionKind_LZ4
+    elif name == 'ZSTD':
+        return _CompressionKind_ZSTD
+    else:
+        return _CompressionKind_NONE

Review comment:
       Can you raise an error here for an unknown compression (after checking 
explicitly for the None case)? So that you don't silently get no compression if 
you have for example a typo.

##########
File path: python/pyarrow/orc.py
##########
@@ -117,8 +117,38 @@ def read(self, columns=None):
         return self.reader.read(columns=columns)
 
 
+_orc_writer_arg_docs = """file_version : {"0.11", "0.12"}, default "0.12"
+    Determine which ORC file version to use. Hive 0.11 / ORC v0 is the older
+    version as defined `here <https://orc.apache.org/specification/ORCv0/>`
+    while Hive 0.12 / ORC v1 is the newer one as defined
+    `here <https://orc.apache.org/specification/ORCv1/>`.
+stripe_size : int, default 64 * 1024 * 1024
+    Size of each ORC stripe.
+compression : string, default 'snappy'

Review comment:
       The default is ZLIB based on the C++?

##########
File path: python/pyarrow/orc.py
##########
@@ -143,17 +210,52 @@ def write(self, table):
         schema : pyarrow.lib.Table
             The table to be written into the ORC file
         """
+        assert self.is_open
         self.writer.write(table)
 
     def close(self):
         """
         Close the ORC file
         """
-        self.writer.close()
+        if self.is_open:
+            self.writer.close()
+            self.is_open = False
 
 
-def write_table(table, where):
-    """
+def write_table(table, where, file_version='0.12',
+                stripe_size=67108864,
+                compression='snappy',
+                compression_block_size=65536,
+                compression_strategy='speed',
+                row_index_stride=10000,
+                padding_tolerance=0.0,
+                dictionary_key_size_threshold=0.0,
+                bloom_filter_columns=None,
+                bloom_filter_fpp=0.05):
+    if isinstance(where, Table):
+        warnings.warn(
+            "The order of the arguments has changed. Pass as "
+            "'write_table(table, where)' instead. The old order will raise "
+            "an error in the future.", FutureWarning, stacklevel=2
+        )
+        table, where = where, table
+    writer = ORCWriter(
+        where,
+        file_version=file_version,
+        stripe_size=stripe_size,
+        compression=compression,
+        compression_block_size=compression_block_size,
+        compression_strategy=compression_strategy,
+        row_index_stride=row_index_stride,
+        padding_tolerance=padding_tolerance,
+        dictionary_key_size_threshold=dictionary_key_size_threshold,
+        bloom_filter_columns=bloom_filter_columns,
+        bloom_filter_fpp=bloom_filter_fpp
+    )
+    writer.write(table)
+    writer.close()
+
+    write_table.__doc__ = """

Review comment:
       ```suggestion
   
   write_table.__doc__ = """
   ```

##########
File path: python/pyarrow/_orc.pyx
##########
@@ -38,6 +38,283 @@ from pyarrow.lib cimport (check_status, _Weakrefable,
                           get_writer)
 from pyarrow.lib import tobytes
 
+cdef compression_kind_from_enum(CompressionKind compression_kind_):
+    return {
+        _CompressionKind_NONE: 'NONE',
+        _CompressionKind_ZLIB: 'ZLIB',
+        _CompressionKind_SNAPPY: 'SNAPPY',
+        _CompressionKind_LZO: 'LZO',
+        _CompressionKind_LZ4: 'LZ4',
+        _CompressionKind_ZSTD: 'ZSTD',
+    }.get(compression_kind_, 'UNKNOWN')
+
+cdef CompressionKind compression_kind_from_name(name):
+    name = name.upper()
+    if name == 'ZLIB':
+        return _CompressionKind_ZLIB
+    elif name == 'SNAPPY':
+        return _CompressionKind_SNAPPY
+    elif name == 'LZO':
+        return _CompressionKind_LZO
+    elif name == 'LZ4':
+        return _CompressionKind_LZ4
+    elif name == 'ZSTD':
+        return _CompressionKind_ZSTD
+    else:
+        return _CompressionKind_NONE
+
+cdef compression_strategy_from_enum(CompressionStrategy compression_strategy_):
+    return {
+        _CompressionStrategy_SPEED: 'SPEED',
+        _CompressionStrategy_COMPRESSION: 'COMPRESSION',
+    }.get(compression_strategy_, 'UNKNOWN')
+
+cdef CompressionStrategy compression_strategy_from_name(name):
+    name = name.upper()
+    # SPEED is the default value in the ORC C++ implementaton
+    if name == 'COMPRESSION':
+        return _CompressionStrategy_COMPRESSION
+    else:
+        return _CompressionStrategy_SPEED
+
+cdef rle_version_from_enum(RleVersion rle_version_):
+    return {
+        _RleVersion_1: '1',
+        _RleVersion_2: '2',
+    }.get(rle_version_, 'UNKNOWN')
+
+cdef bloom_filter_version_from_enum(BloomFilterVersion bloom_filter_version_):
+    return {
+        _BloomFilterVersion_ORIGINAL: 'ORIGINAL',
+        _BloomFilterVersion_UTF8: 'UTF8',
+        _BloomFilterVersion_FUTURE: 'FUTURE',
+    }.get(bloom_filter_version_, 'UNKNOWN')
+
+cdef file_version_from_class(FileVersion file_version_):
+    cdef object file_version = file_version_.ToString()
+    return file_version
+
+cdef class ORCWriterOptions(_Weakrefable):
+    cdef:
+        unique_ptr[WriterOptions] options

Review comment:
       Is this `ORCWriterOptions` class needed? I think the 
`_create_writer_options` function below can also directly construct a C++ 
WriterOptions object, and set the options that way? (this is how 
`_parquet.pyx::_create_writer_properties` does it)
   
   (this class is currently also not publicly exposed)

##########
File path: python/pyarrow/_orc.pyx
##########
@@ -38,6 +38,283 @@ from pyarrow.lib cimport (check_status, _Weakrefable,
                           get_writer)
 from pyarrow.lib import tobytes
 
+cdef compression_kind_from_enum(CompressionKind compression_kind_):
+    return {
+        _CompressionKind_NONE: 'NONE',
+        _CompressionKind_ZLIB: 'ZLIB',
+        _CompressionKind_SNAPPY: 'SNAPPY',
+        _CompressionKind_LZO: 'LZO',
+        _CompressionKind_LZ4: 'LZ4',
+        _CompressionKind_ZSTD: 'ZSTD',
+    }.get(compression_kind_, 'UNKNOWN')
+
+cdef CompressionKind compression_kind_from_name(name):

Review comment:
       Small styling nit: can you leave 2 blank lines between module-level 
functions/classes?

##########
File path: python/pyarrow/_orc.pyx
##########
@@ -38,6 +38,283 @@ from pyarrow.lib cimport (check_status, _Weakrefable,
                           get_writer)
 from pyarrow.lib import tobytes
 
+cdef compression_kind_from_enum(CompressionKind compression_kind_):
+    return {
+        _CompressionKind_NONE: 'NONE',

Review comment:
       We can maybe call this `UNCOMPRESSED` ?

##########
File path: cpp/src/arrow/adapters/orc/adapter_options.cc
##########
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/adapters/orc/adapter_options.h"
+
+#include "arrow/adapters/orc/adapter.h"
+#include "orc/Common.hh"
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+namespace liborc = orc;
+
+namespace arrow {
+
+namespace adapters {
+
+namespace orc {
+
+std::string FileVersion::ToString() const {
+  std::stringstream ss;
+  ss << major() << '.' << minor();
+  return ss.str();
+}
+
+const FileVersion& FileVersion::v_0_11() {
+  static FileVersion version(0, 11);
+  return version;
+}
+
+const FileVersion& FileVersion::v_0_12() {
+  static FileVersion version(0, 12);
+  return version;
+}
+
+struct WriterOptionsPrivate {
+  uint64_t stripe_size_;
+  uint64_t compression_block_size_;
+  uint64_t row_index_stride_;
+  CompressionKind compression_;
+  CompressionStrategy compression_strategy_;
+  liborc::MemoryPool* memory_pool_;
+  double padding_tolerance_;
+  std::ostream* error_stream_;
+  FileVersion file_version_;
+  double dictionary_key_size_threshold_;
+  bool enable_index_;
+  std::set<uint64_t> columns_use_bloom_filter_;
+  double bloom_filter_false_positive_prob_;
+  BloomFilterVersion bloom_filter_version_;
+
+  WriterOptionsPrivate() : file_version_(FileVersion::v_0_12()) {
+    stripe_size_ = 64 * 1024 * 1024;      // 64M
+    compression_block_size_ = 64 * 1024;  // 64K
+    row_index_stride_ = 10000;
+    compression_ = CompressionKind_ZLIB;
+    compression_strategy_ = CompressionStrategy_SPEED;
+    memory_pool_ = liborc::getDefaultPool();
+    padding_tolerance_ = 0.0;
+    error_stream_ = &std::cerr;
+    dictionary_key_size_threshold_ = 0.0;
+    enable_index_ = true;
+    bloom_filter_false_positive_prob_ = 0.05;
+    bloom_filter_version_ = UTF8;
+  }
+};
+
+WriterOptions::WriterOptions()
+    : private_bits_(std::unique_ptr<WriterOptionsPrivate>(new 
WriterOptionsPrivate())) {
+  // PASS
+}
+
+WriterOptions::WriterOptions(const WriterOptions& rhs)
+    : private_bits_(std::unique_ptr<WriterOptionsPrivate>(
+          new WriterOptionsPrivate(*(rhs.private_bits_.get())))) {
+  // PASS
+}
+
+WriterOptions::WriterOptions(WriterOptions& rhs) {
+  // swap private_bits with rhs
+  private_bits_.swap(rhs.private_bits_);
+}
+
+WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+  if (this != &rhs) {
+    private_bits_.reset(new WriterOptionsPrivate(*(rhs.private_bits_.get())));
+  }
+  return *this;
+}
+
+WriterOptions::~WriterOptions() {
+  // PASS
+}
+RleVersion WriterOptions::rle_version() const {
+  if (private_bits_->file_version_ == FileVersion::v_0_11()) {
+    return RleVersion_1;
+  }
+
+  return RleVersion_2;
+}
+
+WriterOptions& WriterOptions::set_stripe_size(uint64_t size) {
+  private_bits_->stripe_size_ = size;

Review comment:
       Wondering for the implementation strategy: now it seems you mimic the 
WriterOptions class from the ORC library (including the private_bits way), and 
in the end have one function that converts our Options class into a liborc 
Options class. 
   Would an alternative be to have our WriterOptions class wrap more directly 
liborc? So that eg this `set_stripe_size` method would call 
`orc_writer_options_->setStripeSize(size)` (more directly forwarding the calls 
to liborc::WriterOptions)
   
   (just wondering, I don't know if that's necessarily better) 

##########
File path: python/pyarrow/orc.py
##########
@@ -117,8 +117,38 @@ def read(self, columns=None):
         return self.reader.read(columns=columns)
 
 
+_orc_writer_arg_docs = """file_version : {"0.11", "0.12"}, default "0.12"
+    Determine which ORC file version to use. Hive 0.11 / ORC v0 is the older
+    version as defined `here <https://orc.apache.org/specification/ORCv0/>`
+    while Hive 0.12 / ORC v1 is the newer one as defined
+    `here <https://orc.apache.org/specification/ORCv1/>`.
+stripe_size : int, default 64 * 1024 * 1024
+    Size of each ORC stripe.
+compression : string, default 'snappy'
+    Specify the compression codec.
+    Valid values: {'NONE', 'SNAPPY', 'ZLIB', 'LZ0', 'LZ4', 'ZSTD'}
+compression_block_size : int, default 64 * 1024
+    Specify the size of each compression block.
+compression_strategy : string, default 'speed'
+    Specify the compression strategy i.e. speed vs size reduction.
+    Valid values: {'SPEED', 'COMPRESSION'}
+row_index_stride : int, default 10000
+    Specify the row index stride i.e. the number of rows per
+    an entry in the row index.
+padding_tolerance : double, default 0.0
+    Set the padding tolerance.
+dictionary_key_size_threshold : double, default 0.0
+    Specify if we should write statistics in general (default is True) or only
+    for some columns.

Review comment:
       I think this can use some more explanation of the link between the 
default of True being 0.0




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9702: ARROW-11297: [C++][Python] Add ORC writer options

Reply via email to