[jira] [Created] (ORC-465) [C++] Add a test for PATCHED_BASE encoding

2019-01-28 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-465:
--

 Summary: [C++] Add a test for PATCHED_BASE encoding
 Key: ORC-465
 URL: https://issues.apache.org/jira/browse/ORC-465
 Project: ORC
  Issue Type: Test
Reporter: Fang Zheng


Add a test case to verify that PATCHED_BASE encoding can properly handle a gap 
larger than 255.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-464) [C++] Avoid computing zigzag values for DELTA and SHORT_REPEAT encoding

2019-01-28 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-464:
--

 Summary: [C++] Avoid computing zigzag values for DELTA and 
SHORT_REPEAT encoding
 Key: ORC-464
 URL: https://issues.apache.org/jira/browse/ORC-464
 Project: ORC
  Issue Type: Improvement
  Components: C++
Reporter: Fang Zheng


In current implementation, RleEncoderV2::determineEncoding() always computes 
zigzag values before determining the encoding type. However, zigzag values are 
only needed for DIRECT and PATCHED_BASE encoding, but not for DELTA and 
SHORT_REPEAT. For efficiency
, we shall only perform zigzag computation when it's determined to be necessary.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-456) [C++] Simplify code logic in RleEncoderV2

2019-01-04 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-456:
--

 Summary: [C++] Simplify code logic in RleEncoderV2
 Key: ORC-456
 URL: https://issues.apache.org/jira/browse/ORC-456
 Project: ORC
  Issue Type: Improvement
  Components: C++
Reporter: Fang Zheng


There is suboptimal code in RleEncoderV2::write() function (between lines 
132-146). 

When a tailing min repeat run (i.e., 3 identical values) is detected and there 
is an ongoing variable run, the code copies the tailing identical values to a 
buffer named "tailVals", writes values before these values in literals buffer 
out, and then copies the 3 values back to the beginning of literals buffer.

Given that the last 3 values are known to be equal to the current value passed 
in write() function, we do not need to copy those values back and forth through 
the tailVals buffer. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-03 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-445:
--

 Summary: [C++] Code improvements in RLEV2Util
 Key: ORC-445
 URL: https://issues.apache.org/jira/browse/ORC-445
 Project: ORC
  Issue Type: Improvement
  Components: C++
Reporter: Fang Zheng


This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can be 
optimized by replacing the if-else statements with direct array lookup:

  inline uint32_t getClosestFixedBits(uint32_t n);
  inline uint32_t getClosestAlignedFixedBits(uint32_t n);
  inline uint32_t encodeBitWidth(uint32_t n);
  inline uint32_t findClosestNumBits(int64_t value);




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-29 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-443:
--

 Summary: [C++] Code improvements in ColumnWriter
 Key: ORC-443
 URL: https://issues.apache.org/jira/browse/ORC-443
 Project: ORC
  Issue Type: Improvement
  Components: C++
Reporter: Fang Zheng


A few changes to ColumnWriter and its derived classes:

1. in add() function, re-order code to verify input parameters before modifying 
any internal state.

2. in add() function, move the calls to colIndexStatistics->increase(1) out of 
the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-442) [C++] Code improvements in Statistics and Writer

2018-11-29 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-442:
--

 Summary: [C++] Code improvements in Statistics and Writer
 Key: ORC-442
 URL: https://issues.apache.org/jira/browse/ORC-442
 Project: ORC
  Issue Type: Improvement
  Components: C++
Reporter: Fang Zheng


A few code changes in Statistics and Writer classes:

1. Change StatisticsImpl to use vector instead of list for storing 
ColumnStatistics. Because the required operations are push_back() in ctor, 
iteration in dtor, and random element access in getColumnStatistics(), and list 
does not support random access in constant time, vector would be more 
appropriate than list.

2.  InternalBooleanStatistics is currently typedef-ed as 
InternalStatisticsImpl. Since min/max/sum does not apply to 
BooleanColumnStatistics, we should define InternalBooleanStatistics to be 
InternalStatisticsImpl to save 21 bytes per instance.

3. Misc. changes to ColumnWriter.hh, Writer.cc, Compression.hh, and 
Statistics.hh to fix typos in Doxygen and reduce object copies.

Please see PR for details.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-431) Fix typo in exception message and simplify code logic

2018-11-01 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-431:
--

 Summary: Fix typo in exception message and simplify code logic
 Key: ORC-431
 URL: https://issues.apache.org/jira/browse/ORC-431
 Project: ORC
  Issue Type: Improvement
  Components: C++
Reporter: Fang Zheng


1. Fix typo in the exception message in WriterOptions::setFileVersion(): 
"Unpoorted" should be "Unsupported".

2. Simplify some code in Writer.cc and OutputStream.cc to avoid re-computation.

3. Simplify a condition check in WriterImpl::buildFooterType().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-429) Refactor code in TypeImpl.cc

2018-10-31 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-429:
--

 Summary: Refactor code in TypeImpl.cc
 Key: ORC-429
 URL: https://issues.apache.org/jira/browse/ORC-429
 Project: ORC
  Issue Type: Improvement
Reporter: Fang Zheng


Propose to make two changes to the code in TypeImpl.cc
 
1. In convertType() function: in the case of proto::Type_Kind_STRUCT, two 
vectors are created but never used. They shall be removed.

2. In TypeImpl::parseType() function: the function calls input.substr() to copy 
the substring before parsing it. This string copy can be avoided by directly 
parsing on the input string. Please see pull request for details.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-428) Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer

2018-10-30 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-428:
--

 Summary: Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, 
and Writer
 Key: ORC-428
 URL: https://issues.apache.org/jira/browse/ORC-428
 Project: ORC
  Issue Type: Bug
  Components: C++
Reporter: Fang Zheng


In OrcFile.hh, the declarations of  readLocalFile() and other four functions 
return  ORC_UNIQUE_PTR:

ORC_UNIQUE_PTR readLocalFile(const std::string& path);

ORC_UNIQUE_PTR readHdfsFile(const std::string& path);

ORC_UNIQUE_PTR createReader(ORC_UNIQUE_PTR stream,
 const ReaderOptions& options);

ORC_UNIQUE_PTR createWriter(const Type& type, OutputStream* stream, 
const WriterOptions& options);

However, these functions' definitions all return std::unique_ptr. On a system 
where ORC_UNIQUE_PTR is not defined as std::unique_ptr but std::auto_ptr, there 
is inconsistency.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-427) Fix errors in ORC C++ public API Doxygen documentation

2018-10-30 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-427:
--

 Summary: Fix errors in ORC C++ public API Doxygen documentation
 Key: ORC-427
 URL: https://issues.apache.org/jira/browse/ORC-427
 Project: ORC
  Issue Type: Bug
  Components: C++, documentation
Reporter: Fang Zheng


There are a few typos, grammar and formatting errors in the C++ public API 
Doxygen documentation. For example, there is a broken sentence in 
createReader() function in OrcFile.hh:

"Create a reader to the for the ORC file."

Please see the pull request for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ORC-426) Errors in ORC Specification

2018-10-29 Thread Fang Zheng (JIRA)
Fang Zheng created ORC-426:
--

 Summary: Errors in ORC Specification
 Key: ORC-426
 URL: https://issues.apache.org/jira/browse/ORC-426
 Project: ORC
  Issue Type: Bug
  Components: documentation
Reporter: Fang Zheng


There are some errors in the ORC format specifications:

1. In specification/ORCv1.md and specification/ORCv2.md, the following sentence 
appears twice in the description of "Patched Base”:

Data values (W * L bits padded to the byte) - A sequence of W bit positive
  values that are added to the base value.

2. In specification/ORCv0.md, specification/ORCv1.md, and 
specification/ORCv2.md, there is an error in the description of “Map Columns”:

Maps are encoded as the PRESENT stream and a length stream with number
of items in each list. —> The last word “list” should be changed to “map”

3. In specification/ORCv1.md and specification/ORCv2.md, the word 
“BloomFilterEntry” should be changed to “bloom filter entry”, as 
“BloomFilterEntry” does not exist in the source code or ProtocolBuffer 
definition.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)