[jira] [Created] (ORC-465) [C++] Add a test for PATCHED_BASE encoding
Fang Zheng created ORC-465: -- Summary: [C++] Add a test for PATCHED_BASE encoding Key: ORC-465 URL: https://issues.apache.org/jira/browse/ORC-465 Project: ORC Issue Type: Test Reporter: Fang Zheng Add a test case to verify that PATCHED_BASE encoding can properly handle a gap larger than 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-464) [C++] Avoid computing zigzag values for DELTA and SHORT_REPEAT encoding
Fang Zheng created ORC-464: -- Summary: [C++] Avoid computing zigzag values for DELTA and SHORT_REPEAT encoding Key: ORC-464 URL: https://issues.apache.org/jira/browse/ORC-464 Project: ORC Issue Type: Improvement Components: C++ Reporter: Fang Zheng In current implementation, RleEncoderV2::determineEncoding() always computes zigzag values before determining the encoding type. However, zigzag values are only needed for DIRECT and PATCHED_BASE encoding, but not for DELTA and SHORT_REPEAT. For efficiency , we shall only perform zigzag computation when it's determined to be necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-456) [C++] Simplify code logic in RleEncoderV2
Fang Zheng created ORC-456: -- Summary: [C++] Simplify code logic in RleEncoderV2 Key: ORC-456 URL: https://issues.apache.org/jira/browse/ORC-456 Project: ORC Issue Type: Improvement Components: C++ Reporter: Fang Zheng There is suboptimal code in RleEncoderV2::write() function (between lines 132-146). When a tailing min repeat run (i.e., 3 identical values) is detected and there is an ongoing variable run, the code copies the tailing identical values to a buffer named "tailVals", writes values before these values in literals buffer out, and then copies the 3 values back to the beginning of literals buffer. Given that the last 3 values are known to be equal to the current value passed in write() function, we do not need to copy those values back and forth through the tailVals buffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-445) [C++] Code improvements in RLEV2Util
Fang Zheng created ORC-445: -- Summary: [C++] Code improvements in RLEV2Util Key: ORC-445 URL: https://issues.apache.org/jira/browse/ORC-445 Project: ORC Issue Type: Improvement Components: C++ Reporter: Fang Zheng This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can be optimized by replacing the if-else statements with direct array lookup: inline uint32_t getClosestFixedBits(uint32_t n); inline uint32_t getClosestAlignedFixedBits(uint32_t n); inline uint32_t encodeBitWidth(uint32_t n); inline uint32_t findClosestNumBits(int64_t value); -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-443) [C++] Code improvements in ColumnWriter
Fang Zheng created ORC-443: -- Summary: [C++] Code improvements in ColumnWriter Key: ORC-443 URL: https://issues.apache.org/jira/browse/ORC-443 Project: ORC Issue Type: Improvement Components: C++ Reporter: Fang Zheng A few changes to ColumnWriter and its derived classes: 1. in add() function, re-order code to verify input parameters before modifying any internal state. 2. in add() function, move the calls to colIndexStatistics->increase(1) out of the loop. Many of those are virtual function calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-442) [C++] Code improvements in Statistics and Writer
Fang Zheng created ORC-442: -- Summary: [C++] Code improvements in Statistics and Writer Key: ORC-442 URL: https://issues.apache.org/jira/browse/ORC-442 Project: ORC Issue Type: Improvement Components: C++ Reporter: Fang Zheng A few code changes in Statistics and Writer classes: 1. Change StatisticsImpl to use vector instead of list for storing ColumnStatistics. Because the required operations are push_back() in ctor, iteration in dtor, and random element access in getColumnStatistics(), and list does not support random access in constant time, vector would be more appropriate than list. 2. InternalBooleanStatistics is currently typedef-ed as InternalStatisticsImpl. Since min/max/sum does not apply to BooleanColumnStatistics, we should define InternalBooleanStatistics to be InternalStatisticsImpl to save 21 bytes per instance. 3. Misc. changes to ColumnWriter.hh, Writer.cc, Compression.hh, and Statistics.hh to fix typos in Doxygen and reduce object copies. Please see PR for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-431) Fix typo in exception message and simplify code logic
Fang Zheng created ORC-431: -- Summary: Fix typo in exception message and simplify code logic Key: ORC-431 URL: https://issues.apache.org/jira/browse/ORC-431 Project: ORC Issue Type: Improvement Components: C++ Reporter: Fang Zheng 1. Fix typo in the exception message in WriterOptions::setFileVersion(): "Unpoorted" should be "Unsupported". 2. Simplify some code in Writer.cc and OutputStream.cc to avoid re-computation. 3. Simplify a condition check in WriterImpl::buildFooterType(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-429) Refactor code in TypeImpl.cc
Fang Zheng created ORC-429: -- Summary: Refactor code in TypeImpl.cc Key: ORC-429 URL: https://issues.apache.org/jira/browse/ORC-429 Project: ORC Issue Type: Improvement Reporter: Fang Zheng Propose to make two changes to the code in TypeImpl.cc 1. In convertType() function: in the case of proto::Type_Kind_STRUCT, two vectors are created but never used. They shall be removed. 2. In TypeImpl::parseType() function: the function calls input.substr() to copy the substring before parsing it. This string copy can be avoided by directly parsing on the input string. Please see pull request for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-428) Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer
Fang Zheng created ORC-428: -- Summary: Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer Key: ORC-428 URL: https://issues.apache.org/jira/browse/ORC-428 Project: ORC Issue Type: Bug Components: C++ Reporter: Fang Zheng In OrcFile.hh, the declarations of readLocalFile() and other four functions return ORC_UNIQUE_PTR: ORC_UNIQUE_PTR readLocalFile(const std::string& path); ORC_UNIQUE_PTR readHdfsFile(const std::string& path); ORC_UNIQUE_PTR createReader(ORC_UNIQUE_PTR stream, const ReaderOptions& options); ORC_UNIQUE_PTR createWriter(const Type& type, OutputStream* stream, const WriterOptions& options); However, these functions' definitions all return std::unique_ptr. On a system where ORC_UNIQUE_PTR is not defined as std::unique_ptr but std::auto_ptr, there is inconsistency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-427) Fix errors in ORC C++ public API Doxygen documentation
Fang Zheng created ORC-427: -- Summary: Fix errors in ORC C++ public API Doxygen documentation Key: ORC-427 URL: https://issues.apache.org/jira/browse/ORC-427 Project: ORC Issue Type: Bug Components: C++, documentation Reporter: Fang Zheng There are a few typos, grammar and formatting errors in the C++ public API Doxygen documentation. For example, there is a broken sentence in createReader() function in OrcFile.hh: "Create a reader to the for the ORC file." Please see the pull request for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ORC-426) Errors in ORC Specification
Fang Zheng created ORC-426: -- Summary: Errors in ORC Specification Key: ORC-426 URL: https://issues.apache.org/jira/browse/ORC-426 Project: ORC Issue Type: Bug Components: documentation Reporter: Fang Zheng There are some errors in the ORC format specifications: 1. In specification/ORCv1.md and specification/ORCv2.md, the following sentence appears twice in the description of "Patched Base”: Data values (W * L bits padded to the byte) - A sequence of W bit positive values that are added to the base value. 2. In specification/ORCv0.md, specification/ORCv1.md, and specification/ORCv2.md, there is an error in the description of “Map Columns”: Maps are encoded as the PRESENT stream and a length stream with number of items in each list. —> The last word “list” should be changed to “map” 3. In specification/ORCv1.md and specification/ORCv2.md, the word “BloomFilterEntry” should be changed to “bloom filter entry”, as “BloomFilterEntry” does not exist in the source code or ProtocolBuffer definition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)