[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-14 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513346#comment-16513346
 ] 

Lefty Leverenz commented on ORC-343:


Okay, thanks.

> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-14 Thread Gang Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513316#comment-16513316
 ] 

Gang Wu commented on ORC-343:
-

[~leftylev] I don't think it is necessary. This is not a new feature but a C++ 
parity.

> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-13 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511849#comment-16511849
 ] 

Lefty Leverenz commented on ORC-343:


Should this be documented in the wiki?

> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510361#comment-16510361
 ] 

ASF GitHub Bot commented on ORC-343:


Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/273


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509197#comment-16509197
 ] 

ASF GitHub Bot commented on ORC-343:


Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/273
  
Thanks everyone for reviewing this!

Does anyone have any comment? If not, I will merge tomorrow.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504701#comment-16504701
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/273
  
+1 LGTM


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500164#comment-16500164
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r192729045
  
--- Diff: c++/test/TestRleEncoder.cc ---
@@ -0,0 +1,243 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "MemoryOutputStream.hh"
+#include "RLEv1.hh"
+
+#include "wrap/orc-proto-wrapper.hh"
+#include "wrap/gtest-wrapper.h"
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateData(
+ uint64_t numValues,
+ int64_t start,
+ int64_t delta,
+ bool random,
+ int64_t* data,
+ uint64_t numNulls = 0,
+ char* notNull = nullptr) {
+if (numNulls != 0 && notNull != nullptr) {
+  memset(notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if (notNull[pos]) {
+  notNull[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i])
+  {
+if (!random) {
+  data[i] = start + delta * static_cast(i);
+} else {
+  data[i] = std::rand();
+}
+  }
+}
+  }
+
+  void decodeAndVerify(
+   RleVersion version,
+   const MemoryOutputStream& memStream,
+   int64_t * data,
+   uint64_t numValues,
+   const char* notNull,
+   bool isSinged) {
+std::unique_ptr decoder = createRleDecoder(
+std::unique_ptr(new 
SeekableArrayInputStream(
+memStream.getData(),
+memStream.getLength())),
+isSinged, version, *getDefaultPool());
+
+int64_t* decodedData = new int64_t[numValues];
+decoder->next(decodedData, numValues, notNull);
+
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (!notNull || notNull[i]) {
+EXPECT_EQ(data[i], decodedData[i]);
+  }
+}
+
+delete [] decodedData;
+  }
+
+  std::unique_ptr getEncoder(RleVersion version,
+MemoryOutputStream& memStream,
+bool isSigned)
+  {
+MemoryPool * pool = getDefaultPool();
+
+return createRleEncoder(
+std::unique_ptr(
+new BufferedOutputStream(*pool, , 500 * 
1024, 1024)),
+isSigned, version, *pool, true);
--- End diff --

can we template these tests for `alignedBitpacking =  false`?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499674#comment-16499674
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r192614780
  
--- Diff: c++/src/RLEv2.hh ---
@@ -25,13 +25,89 @@
 
 #include 
 
+#define MIN_REPEAT 3
+#define HIST_LEN 32
 namespace orc {
 
-class RleDecoderV2 : public RleDecoder {
+struct FixedBitSizes {
+enum FBS {
+ONE = 0, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, 
ELEVEN, TWELVE,
+THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN, 
NINETEEN,
+TWENTY, TWENTYONE, TWENTYTWO, TWENTYTHREE, TWENTYFOUR, TWENTYSIX,
+TWENTYEIGHT, THIRTY, THIRTYTWO, FORTY, FORTYEIGHT, FIFTYSIX, 
SIXTYFOUR, SIZE
+};
+};
+
+enum EncodingType { SHORT_REPEAT=0, DIRECT=1, PATCHED_BASE=2, DELTA=3 };
+
+struct EncodingOption {
+  EncodingType encoding;
+  int64_t fixedDelta;
+  int64_t gapVsPatchListCount;
+  int64_t zigzagLiteralsCount;
+  int64_t baseRedLiteralsCount;
+  int64_t adjDeltasCount;
+  uint32_t zzBits90p;
+  uint32_t zzBits100p;
+  uint32_t brBits95p;
+  uint32_t brBits100p;
+  uint32_t bitsDeltaMax;
+  uint32_t patchWidth;
+  uint32_t patchGapWidth;
+  uint32_t patchLength;
+  int64_t min;
+  bool isFixedDelta;
+};
+
+class RleEncoderV2 : public RleEncoder {
 public:
+RleEncoderV2(std::unique_ptr outStream, bool 
hasSigned, bool alignBitPacking = true);
--- End diff --

Done


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499456#comment-16499456
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/273
  
The PR looks overall good to me apart from a minor change requested. This 
is an important patch to align the C++ and Java implementations. Thanks again 
for working on this!


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496146#comment-16496146
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191995987
  
--- Diff: c++/src/RLE.hh ---
@@ -68,7 +76,24 @@ namespace orc {
  * record current position
  * @param recorder use the recorder to record current positions
  */
-virtual void recordPosition(PositionRecorder* recorder) const = 0;
+virtual void recordPosition(PositionRecorder* recorder) const;
+
+  protected:
+std::unique_ptr outputStream;
+size_t bufferPosition;
+size_t bufferLength;
+size_t numLiterals;
+int64_t* literals;
+bool isSigned;
+char* buffer;
--- End diff --

Added initialisation to constructor to initialised it to nullptr.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495758#comment-16495758
 ] 

ASF GitHub Bot commented on ORC-343:


Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191933565
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -1675,7 +1677,9 @@ namespace orc {
   void ListColumnWriter::getColumnEncoding(
 std::vector& encodings) const {
 proto::ColumnEncoding encoding;
-encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_kind(rleVersion == RleVersion_1 ?
--- End diff --

You could clean up the initializer list and avoid long term version 
mismatch issues if you add a helper function that maps RleVersion1 to 
ColumnEncoding_Kind_DIRECT and RleVersion_2 to ColumnEncoding_Kind_DIRECT_V2 
and errors/throws on other values.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495757#comment-16495757
 ] 

ASF GitHub Bot commented on ORC-343:


Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191935614
  
--- Diff: c++/src/RLE.hh ---
@@ -68,7 +76,24 @@ namespace orc {
  * record current position
  * @param recorder use the recorder to record current positions
  */
-virtual void recordPosition(PositionRecorder* recorder) const = 0;
+virtual void recordPosition(PositionRecorder* recorder) const;
+
+  protected:
+std::unique_ptr outputStream;
+size_t bufferPosition;
+size_t bufferLength;
+size_t numLiterals;
+int64_t* literals;
+bool isSigned;
+char* buffer;
--- End diff --

Is this initialized anywhere?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495756#comment-16495756
 ] 

ASF GitHub Bot commented on ORC-343:


Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191932328
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -38,6 +38,11 @@ namespace orc {
 CompressionStrategy_COMPRESSION
   };
 
+  enum RleVersion {
+RleVersion_1,
--- End diff --

You may want to explicitly assign values here to ensure the addition of new 
encodings or refactoring work can't unintentionally change the underlying 
integer values.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495390#comment-16495390
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191832950
  
--- Diff: c++/src/Writer.cc ---
@@ -122,9 +127,17 @@ namespace orc {
   }
 
   WriterOptions& WriterOptions::setFileVersion(const FileVersion& version) 
{
-// Only Hive_0_11 version is supported currently
-if (version.getMajor() == 0 && version.getMinor() == 11) {
+// Only Hive_0_11 and Hive_0_12 version are supported currently
+if (version.getMajor() == 0 && (version.getMinor() == 11 || 
version.getMinor() == 12)) {
--- End diff --

My suggestion is to use this logic to implement 
`WriterOptions::getRleVersion()`.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495379#comment-16495379
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191830703
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
--- End diff --

To be clear, do we need this `RleVersion` here and in the `WriterOptions`?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495373#comment-16495373
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191829402
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
--- End diff --

I think the file version should determine the `RleVersion`. Refer 
`isNewWriteFormat` and `isDirectV2` on the Java side.



> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495320#comment-16495320
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191816749
  
--- Diff: c++/test/TestWriter.cc ---
@@ -47,7 +47,6 @@ namespace orc {
   const Type& type,
   MemoryPool* memoryPool,
   OutputStream* stream,
-  RleVersion rleVersion,
   FileVersion version = FileVersion(0, 
12)){
--- End diff --

`FileVersion::v_0_12()`


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495317#comment-16495317
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191816344
  
--- Diff: c++/test/TestWriter.cc ---
@@ -139,7 +136,6 @@ namespace orc {
   *type,
   pool,
   ,
-  rleVersion,
   FileVersion(0, 11));
--- End diff --

`FileVersion::v_0_11()`


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495311#comment-16495311
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191815756
  
--- Diff: c++/test/TestWriter.cc ---
@@ -1174,5 +1170,5 @@ namespace orc {
 }
   }
 
-  INSTANTIATE_TEST_CASE_P(OrcTest, WriterTest, Values(RleVersion_1, 
RleVersion_2));
+  INSTANTIATE_TEST_CASE_P(OrcTest, WriterTest, 
Values(FileVersion::v_0_11(), FileVersion::v_0_11()));
--- End diff --

Should be `Values(FileVersion::v_0_11(), FileVersion::v_0_12()))`


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494538#comment-16494538
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191615840
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
--- End diff --

Good point! Should we just add `FixedBitSizes::SIZE` as another element and 
use it then? 


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493240#comment-16493240
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191340586
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
--- End diff --

but FixedBitSizes::LAST equal to 31 here


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493228#comment-16493228
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191337994
  
--- Diff: c++/src/CMakeLists.txt ---
@@ -179,15 +179,15 @@ set(SOURCE_FILES
   OrcFile.cc
   Reader.cc
   RLEv1.cc
-  RLEv2.cc
+  RleDecoderV2.cc
+  RleEncoderV2.cc
--- End diff --

The reason I prefer to Split Encoder and Decoder into two files for V2 
simply because the code has grow too big that it would be very difficult to 
navigate if combine them into a single file.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493227#comment-16493227
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191337473
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
 
 WriterOptionsPrivate() :
-fileVersion(0, 11) { // default to Hive_0_11
+fileVersion(0, 12) { // default to Hive_0_12
--- End diff --

I can do that after PR #274 checked in 


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493225#comment-16493225
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on the issue:

https://github.com/apache/orc/pull/273
  
[benchmark.xlsx](https://github.com/apache/orc/files/2047317/benchmark.xlsx)
RleV2 benchmark


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493046#comment-16493046
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191290786
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
--- End diff --

`>=0.0 to <=1.0`


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493047#comment-16493047
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191291625
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
+// compute the histogram
+for(size_t i = offset; i < (offset + length); i++) {
+uint32_t idx = encodeBitWidth(findClosestNumBits(data[i]));
+histgram[idx] += 1;
+}
+}
+
+int32_t perLen = static_cast(static_cast(length) * 
(1.0 - p));
+
+// return the bits required by pth percentile length
+for(int32_t i = HIST_LEN - 1; i >= 0; i--) {
+perLen -= histgram[i];
+if (perLen < 0) {
+return decodeBitWidth(static_cast(i));
+}
+}
+
--- End diff --

extra line


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493044#comment-16493044
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191291163
  
--- Diff: c++/src/RLEv2.hh ---
@@ -25,13 +25,89 @@
 
 #include 
 
+#define MIN_REPEAT 3
+#define HIST_LEN 32
 namespace orc {
 
-class RleDecoderV2 : public RleDecoder {
+struct FixedBitSizes {
+enum FBS {
+ONE = 0, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, 
ELEVEN, TWELVE,
+THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN, 
NINETEEN,
+TWENTY, TWENTYONE, TWENTYTWO, TWENTYTHREE, TWENTYFOUR, TWENTYSIX,
+TWENTYEIGHT, THIRTY, THIRTYTWO, FORTY, FORTYEIGHT, FIFTYSIX, 
SIXTYFOUR
--- End diff --

can you add another element `LAST=SIXTYFOUR` towards the end?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493045#comment-16493045
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191291449
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
--- End diff --

Use `FixedBitSizes::LAST` instead of 32?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492934#comment-16492934
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191275478
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
 
 WriterOptionsPrivate() :
-fileVersion(0, 11) { // default to Hive_0_11
+fileVersion(0, 12) { // default to Hive_0_12
--- End diff --

We should use the static constants proposed in PR 
https://github.com/apache/orc/pull/274 moving forward.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492933#comment-16492933
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191269083
  
--- Diff: c++/src/CMakeLists.txt ---
@@ -179,15 +179,15 @@ set(SOURCE_FILES
   OrcFile.cc
   Reader.cc
   RLEv1.cc
-  RLEv2.cc
+  RleDecoderV2.cc
+  RleEncoderV2.cc
--- End diff --

We split the Encoder and Decoder into two files for V2 and not for V1. Can 
we combine them into a single file for V2 as well?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492932#comment-16492932
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191271607
  
--- Diff: c++/src/RleDecoderV2.cc ---
@@ -1,10 +1,10 @@
 /**
  * Licensed to the Apache Software Foundation (ASF) under one
  * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
+ * distributed with option work for additional information
--- End diff --

The Apache license header must not change.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492912#comment-16492912
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191267866
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -164,6 +169,16 @@ namespace orc {
  */
 std::ostream * getErrorStream() const;
 
+/**
+ * Set the RLE version.
+ */
+WriterOptions& setRleVersion(RleVersion version);
--- End diff --

`WriterOptions& setRleVersion(const RleVersion& version);`


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492232#comment-16492232
 ] 

ASF GitHub Bot commented on ORC-343:


Github user yuruiz commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191102909
  
--- Diff: c++/src/RLE.cc ---
@@ -64,4 +66,55 @@ namespace orc {
 }
   }
 
+  void RleEncoder::add(const int64_t* data, uint64_t numValues,
+ const char* notNull) {
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (!notNull || notNull[i]) {
+write(data[i]);
+  }
+}
+  }
+
+  void RleEncoder::writeVslong(int64_t val) {
+writeVulong((val << 1) ^ (val >> 63));
+  }
+
+  void RleEncoder::writeVulong(int64_t val) {
+while (true) {
+  if ((val & ~0x7f) == 0) {
+writeByte(static_cast(val));
+return;
+  } else {
+writeByte(static_cast(0x80 | (val & 0x7f)));
+// cast val to unsigned so as to force 0-fill right shift
+val = (static_cast(val) >> 7);
+  }
+}
+  }
+
+  void RleEncoder::writeByte(char c) {
+if (bufferPosition == bufferLength) {
+  int addedSize = 0;
+  if (!outputStream->Next(reinterpret_cast(), 
)) {
+throw std::bad_alloc();
+  }
+  bufferPosition = 0;
+  bufferLength = static_cast(addedSize);
+}
+buffer[bufferPosition++] = c;
+  }
+
+  void RleEncoder::recordPosition(PositionRecorder* recorder) const {
--- End diff --

This method has been exits for a while, removing it requires a wide range 
refactoring, which is not the purpose of this PR.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491059#comment-16491059
 ] 

ASF GitHub Bot commented on ORC-343:


Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/273
  
@xndai  and @yuruiz  thanks for contributing this code. I will take a look 
at this.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490972#comment-16490972
 ] 

ASF GitHub Bot commented on ORC-343:


Github user xndai commented on the issue:

https://github.com/apache/orc/pull/273
  
@majetideepak this is RLEv2 change that was promised.

@yuruiz Could you also include some perf data obtained from offline testing?


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490840#comment-16490840
 ] 

ASF GitHub Bot commented on ORC-343:


Github user rip-nsk commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r190921533
  
--- Diff: c++/src/RLEV2Util.hh ---
@@ -0,0 +1,145 @@
+/**
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+* http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+#ifndef ORC_RLEV2UTIL_HH
+#define ORC_RLEV2UTIL_HH
+
+#include "RLEv2.hh"
+
+namespace orc {
+  inline uint32_t decodeBitWidth(uint32_t n) {
+if (n <= FixedBitSizes::TWENTYFOUR) {
+  return n + 1;
+} else if (n == FixedBitSizes::TWENTYSIX) {
+  return 26;
+} else if (n == FixedBitSizes::TWENTYEIGHT) {
+  return 28;
+} else if (n == FixedBitSizes::THIRTY) {
+  return 30;
+} else if (n == FixedBitSizes::THIRTYTWO) {
+  return 32;
+} else if (n == FixedBitSizes::FORTY) {
+  return 40;
+} else if (n == FixedBitSizes::FORTYEIGHT) {
+  return 48;
+} else if (n == FixedBitSizes::FIFTYSIX) {
+  return 56;
+} else {
+  return 64;
+}
+  }
+
+  inline uint32_t getClosestFixedBits(uint32_t n) {
+if (n == 0) {
+  return 1;
+}
+
+if (n >= 1 && n <= 24) {
+  return n;
+} else if (n <= 26) {
+  return 26;
+} else if (n <= 28) {
+  return 28;
+} else if (n <= 30) {
+  return 30;
+} else if (n <= 32) {
+  return 32;
+} else if (n <= 40) {
+  return 40;
+} else if (n <= 48) {
+  return 48;
+} else if (n <= 56) {
+  return 56;
+} else {
+  return 64;
+}
+  }
+
+  inline uint32_t getClosestAlignedFixedBits(uint32_t n) {
+if (n == 0 ||  n == 1) {
+  return 1;
+} else if (n <= 2) {
+  return 2;
+} else if (n <= 4) {
+  return 4;
+} else if (n <= 8) {
+  return 8;
+} else if (n <= 16) {
+  return 16;
+} else if (n <= 24) {
+  return 24;
+} else if (n <= 32) {
+  return 32;
+} else if (n <= 40) {
+  return 40;
+} else if (n <= 48) {
+  return 48;
+} else if (n <= 56) {
+  return 56;
+} else {
+  return 64;
+}
+  }
+
+  inline uint32_t encodeBitWidth(uint32_t n) {
+n = getClosestFixedBits(n);
+
+if (n >= 1 && n <= 24) {
+  return n - 1;
+} else if (n <= 26) {
+  return FixedBitSizes::TWENTYSIX;
+} else if (n <= 28) {
+  return FixedBitSizes::TWENTYEIGHT;
+} else if (n <= 30) {
+  return FixedBitSizes::THIRTY;
+} else if (n <= 32) {
+  return FixedBitSizes::THIRTYTWO;
+} else if (n <= 40) {
+  return FixedBitSizes::FORTY;
+} else if (n <= 48) {
+  return FixedBitSizes::FORTYEIGHT;
+} else if (n <= 56) {
+  return FixedBitSizes::FIFTYSIX;
+} else {
+  return FixedBitSizes::SIXTYFOUR;
+}
+  }
+
+  inline uint32_t findClosestNumBits(int64_t value) {
+if (value < 0) {
+  return getClosestFixedBits(64);
+}
+
+uint32_t count = 0;
+while (value != 0) {
+  count++;
+  value = value >> 1;
+}
+return getClosestFixedBits(count);
+  }
+
+  inline bool isSafeSubtract(long left, long right) {
--- End diff --

s/long/int64_t


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> 

[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490839#comment-16490839
 ] 

ASF GitHub Bot commented on ORC-343:


Github user rip-nsk commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r190921029
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
+// compute the histogram
+for(size_t i = offset; i < (offset + length); i++) {
+uint32_t idx = encodeBitWidth(findClosestNumBits(data[i]));
+histgram[idx] += 1;
+}
+}
+
+int32_t perLen = static_cast(static_cast(length) * 
(1.0 - p));
+
+// return the bits required by pth percentile length
+for(int32_t i = HIST_LEN - 1; i >= 0; i--) {
+perLen -= histgram[i];
+if (perLen < 0) {
+return decodeBitWidth(static_cast(i));
+}
+}
+
+return 0;
+}
+
+RleEncoderV2::RleEncoderV2(std::unique_ptr outStream,
+   bool hasSigned, bool alignBitPacking) :
+RleEncoder(std::move(outStream), hasSigned),
+alignedBitPacking(alignBitPacking),
+prevDelta(0){
+literals = new int64_t[MAX_LITERAL_SIZE];
+gapVsPatchList = new int64_t[MAX_LITERAL_SIZE];
+zigzagLiterals = new int64_t[MAX_LITERAL_SIZE];
+baseRedLiterals = new int64_t[MAX_LITERAL_SIZE];
+adjDeltas = new int64_t[MAX_LITERAL_SIZE];
+}
+
+void RleEncoderV2::write(int64_t val) {
+if(numLiterals == 0) {
+initializeLiterals(val);
+return;
+}
+
+if(numLiterals == 1) {
+prevDelta = val - literals[0];
+literals[numLiterals++] = val;
+
+if(val == literals[0]) {
+fixedRunLength = 2;
+variableRunLength = 0;
+} else {
+fixedRunLength = 0;
+variableRunLength = 2;
+}
+return;
+}
+
+int64_t currentDelta = val - literals[numLiterals - 1];
+EncodingOption option = {};
+if (prevDelta == 0 && currentDelta == 0) {
+// case 1: fixed delta run
+literals[numLiterals++] = val;
+
+if (variableRunLength > 0) {
+// if variable run is non-zero then we are seeing repeating
+// values at the end of variable run in which case fixed Run
+// length is 2
+fixedRunLength = 2;
+}
+fixedRunLength++;
+
+// if fixed run met the minimum condition and if variable
+// run is non-zero then flush the variable run and shift the
+// tail fixed runs to start of the buffer
+if (fixedRunLength >= MIN_REPEAT && variableRunLength > 0) {
+numLiterals -= MIN_REPEAT;
+variableRunLength -= (MIN_REPEAT - 1);
+
+int64_t tailVals[MIN_REPEAT] = {0};
+
+  

[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490838#comment-16490838
 ] 

ASF GitHub Bot commented on ORC-343:


Github user rip-nsk commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r190920599
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
+// compute the histogram
+for(size_t i = offset; i < (offset + length); i++) {
+uint32_t idx = encodeBitWidth(findClosestNumBits(data[i]));
+histgram[idx] += 1;
+}
+}
+
+int32_t perLen = static_cast(static_cast(length) * 
(1.0 - p));
+
+// return the bits required by pth percentile length
+for(int32_t i = HIST_LEN - 1; i >= 0; i--) {
+perLen -= histgram[i];
+if (perLen < 0) {
+return decodeBitWidth(static_cast(i));
+}
+}
+
+return 0;
+}
+
+RleEncoderV2::RleEncoderV2(std::unique_ptr outStream,
+   bool hasSigned, bool alignBitPacking) :
+RleEncoder(std::move(outStream), hasSigned),
+alignedBitPacking(alignBitPacking),
+prevDelta(0){
+literals = new int64_t[MAX_LITERAL_SIZE];
+gapVsPatchList = new int64_t[MAX_LITERAL_SIZE];
+zigzagLiterals = new int64_t[MAX_LITERAL_SIZE];
+baseRedLiterals = new int64_t[MAX_LITERAL_SIZE];
+adjDeltas = new int64_t[MAX_LITERAL_SIZE];
+}
+
+void RleEncoderV2::write(int64_t val) {
+if(numLiterals == 0) {
+initializeLiterals(val);
+return;
+}
+
+if(numLiterals == 1) {
+prevDelta = val - literals[0];
+literals[numLiterals++] = val;
+
+if(val == literals[0]) {
+fixedRunLength = 2;
+variableRunLength = 0;
+} else {
+fixedRunLength = 0;
+variableRunLength = 2;
+}
+return;
+}
+
+int64_t currentDelta = val - literals[numLiterals - 1];
+EncodingOption option = {};
+if (prevDelta == 0 && currentDelta == 0) {
+// case 1: fixed delta run
+literals[numLiterals++] = val;
+
+if (variableRunLength > 0) {
+// if variable run is non-zero then we are seeing repeating
+// values at the end of variable run in which case fixed Run
+// length is 2
+fixedRunLength = 2;
+}
+fixedRunLength++;
+
+// if fixed run met the minimum condition and if variable
+// run is non-zero then flush the variable run and shift the
+// tail fixed runs to start of the buffer
+if (fixedRunLength >= MIN_REPEAT && variableRunLength > 0) {
+numLiterals -= MIN_REPEAT;
+variableRunLength -= (MIN_REPEAT - 1);
+
+int64_t tailVals[MIN_REPEAT] = {0};
+
+  

[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490837#comment-16490837
 ] 

ASF GitHub Bot commented on ORC-343:


Github user rip-nsk commented on the issue:

https://github.com/apache/orc/pull/273
  
  C:\projects\orc\c++\src\RleEncoderV2.cc(187): warning C4334: '<<': result 
of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?) 
[C:\projects\orc\build\c++\src\orc.vcxproj]
  C:\projects\orc\c++\src\RleEncoderV2.cc(203): warning C4334: '<<': result 
of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?) 
[C:\projects\orc\build\c++\src\orc.vcxproj]
  C:\projects\orc\c++\src\RleEncoderV2.cc(334): warning C4244: 'argument': 
conversion from 'int64_t' to 'long', possible loss of data 
[C:\projects\orc\build\c++\src\orc.vcxproj]
  C:\projects\orc\c++\src\RleEncoderV2.cc(735): warning C4244: 
'initializing': conversion from 'int64_t' to 'long', possible


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490300#comment-16490300
 ] 

ASF GitHub Bot commented on ORC-343:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r190801085
  
--- Diff: c++/src/RLE.hh ---
@@ -68,7 +76,24 @@ namespace orc {
  * record current position
  * @param recorder use the recorder to record current positions
  */
-virtual void recordPosition(PositionRecorder* recorder) const = 0;
+virtual void recordPosition(PositionRecorder* recorder) const;
+
+  protected:
+std::unique_ptr outputStream;
+size_t bufferPosition;
+size_t bufferLength;
+size_t numLiterals;
+int64_t* literals;
+bool isSigned;
+char* buffer;
+
+virtual void write(int64_t val) = 0;
+
+virtual void writeByte(char c);
+
+virtual void writeVulong(int64_t val);
+
+virtual void writeVslong(int64_t val);  protected:
--- End diff --

remove protected


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490301#comment-16490301
 ] 

ASF GitHub Bot commented on ORC-343:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r190800936
  
--- Diff: c++/src/RLE.cc ---
@@ -64,4 +66,55 @@ namespace orc {
 }
   }
 
+  void RleEncoder::add(const int64_t* data, uint64_t numValues,
+ const char* notNull) {
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (!notNull || notNull[i]) {
+write(data[i]);
+  }
+}
+  }
+
+  void RleEncoder::writeVslong(int64_t val) {
+writeVulong((val << 1) ^ (val >> 63));
+  }
+
+  void RleEncoder::writeVulong(int64_t val) {
+while (true) {
+  if ((val & ~0x7f) == 0) {
+writeByte(static_cast(val));
+return;
+  } else {
+writeByte(static_cast(0x80 | (val & 0x7f)));
+// cast val to unsigned so as to force 0-fill right shift
+val = (static_cast(val) >> 7);
+  }
+}
+  }
+
+  void RleEncoder::writeByte(char c) {
+if (bufferPosition == bufferLength) {
+  int addedSize = 0;
+  if (!outputStream->Next(reinterpret_cast(), 
)) {
+throw std::bad_alloc();
+  }
+  bufferPosition = 0;
+  bufferLength = static_cast(addedSize);
+}
+buffer[bufferPosition++] = c;
+  }
+
+  void RleEncoder::recordPosition(PositionRecorder* recorder) const {
--- End diff --

We haven't added support for writing index stream so far. Remove this 
function for now and that should be in a separate change.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490302#comment-16490302
 ] 

ASF GitHub Bot commented on ORC-343:


Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r190801522
  
--- Diff: c++/src/CMakeLists.txt ---
@@ -179,15 +179,17 @@ set(SOURCE_FILES
   OrcFile.cc
   Reader.cc
   RLEv1.cc
-  RLEv2.cc
+  RLEv2.hh
--- End diff --

remove these two .hh files.


> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-05-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490196#comment-16490196
 ] 

ASF GitHub Bot commented on ORC-343:


GitHub user yuruiz opened a pull request:

https://github.com/apache/orc/pull/273

ORC-343 Enable C++ writer to support RleV2

1. Port RleV2 implementation from Java to C++
2. Add RleV2 relevant tests to C++

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yuruiz/orc dev/yuruiz/RLEv2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #273


commit 3fba0988cbd7f324c5d21ae67f8cd50156957886
Author: Yurui Zhou 
Date:   2018-03-09T05:23:37Z

ORC-343 Enable C++ writer to support RleV2
1. Port RleV2 implementation from Java to C++
2. Add RleV2 relevant tests to C++




> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)