[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader

2018-09-05 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/301
  
@wgtmac  thanks for these changes! I will take a look at both patches by 
end of tomorrow.


---


[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader

2018-08-19 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/301
  
@xndai I am not very familiar with the Java side. Looking at the ZSTD 
support in parquet-mr, it looks like they only added the ZSTD support but 
skipped the tests. The parquet-mr library has not upgraded the hadoop library 
as well.

https://github.com/apache/parquet-mr/blob/9fa86cca1af7dabc21701247efd89f6085945bd2/pom.xml#L80
 
https://github.com/apache/parquet-mr/commit/132b2a8c553bdcfd445e88680beac6f225c50ac4#diff-6a038e86a0fc009909af954b3589cd95R159
Can we do something like that here?


---


[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader

2018-08-17 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/301
  
@wgtmac  I will take a look at this today. Do you know what is the current 
behavior on the Java side with this compression format?


---


[GitHub] orc pull request #300: [ORC-394][C++] Add addUserMetadata() function to C++ ...

2018-08-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/300#discussion_r209222998
  
--- Diff: c++/test/TestWriter.cc ---
@@ -1170,5 +1170,52 @@ namespace orc {
 }
   }
 
+  TEST_P(WriterTest, writeUserMetadata) {
--- End diff --

can you add the user metadata check to an existing test?


---


[GitHub] orc pull request #296: [ORC-391][c++] parseType does not accept underscore i...

2018-07-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/296#discussion_r206153654
  
--- Diff: tools/test/TestCSVFileImport.cc ---
@@ -53,3 +53,33 @@ TEST (TestCSVFileImport, test10rows) {
   EXPECT_EQ(expected, output);
   EXPECT_EQ("", error);
 }
+
+TEST (TestCSVFileImport, test10rows_underscore) {
+  // create an ORC file from importing the CSV file
+  const std::string pgm1 = findProgram("tools/src/csv-import");
+  const std::string csvFile = 
findExample("TestCSVFileImport.test10rows.csv");
+  const std::string orcFile = "/tmp/test_csv_import_test_10_rows.orc";
+  const std::string schema = "struct<_a:bigint,b_:string,c:double>";
--- End diff --

Thanks for the test! can you rename `c` to `c_col`?


---


[GitHub] orc issue #293: ORC-388: Fix isSafeSubtract to use logic operator instead of...

2018-07-29 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/293
  
+1 LGTM.
logical operator helps in short-circuit evaluation as well.


---


[GitHub] orc issue #296: [c++] column/field name can take underline

2018-07-29 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/296
  
JIRA name and a test please!


---


[GitHub] orc issue #285: ORC-371 Disable libhdfspp build if dependencies are missing

2018-06-27 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/285
  
+1 LGTM. I will check this in by end of today! Thanks.


---


[GitHub] orc issue #289: ORC-384 fix memory leak when loading non-ORC files

2018-06-27 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/289
  
+1 LGTM. I will check this in by end of today.


---


[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...

2018-06-18 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/275
  
Thanks for this bug. I will push a patch to fix this.


---


[GitHub] orc issue #282: ORC-377: [c++] Add SnappyCompressionStream

2018-06-13 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/282
  
Can you add some tests for this?


---


[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2

2018-06-07 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/273
  
+1 LGTM


---


[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-05 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/277#discussion_r193159979
  
--- Diff: c++/src/Compression.cc ---
@@ -199,13 +199,16 @@ namespace orc {
   uint64_t blockSize,
   MemoryPool& pool);
 
+~ZlibCompressionStream() { end(); }
--- End diff --

Thanks!


---


[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-04 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/277#discussion_r192919379
  
--- Diff: c++/src/TypeImpl.cc ---
@@ -258,31 +258,34 @@ namespace orc {
 
 case STRUCT: {
   StructVectorBatch *result = new StructVectorBatch(capacity, 
memoryPool);
+  std::unique_ptr return_value = 
std::unique_ptr(result);
--- End diff --

we need `result` of type `StructVectorBatch` to access fields on line 263


---


[GitHub] orc issue #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-04 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/277
  
@wgtmac, @xndai  can you take a look at this patch? ZLIB compression code 
seems to be leaking memory. Thanks!


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-06-04 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r192729045
  
--- Diff: c++/test/TestRleEncoder.cc ---
@@ -0,0 +1,243 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "MemoryOutputStream.hh"
+#include "RLEv1.hh"
+
+#include "wrap/orc-proto-wrapper.hh"
+#include "wrap/gtest-wrapper.h"
+
+namespace orc {
+
+  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
+
+  void generateData(
+ uint64_t numValues,
+ int64_t start,
+ int64_t delta,
+ bool random,
+ int64_t* data,
+ uint64_t numNulls = 0,
+ char* notNull = nullptr) {
+if (numNulls != 0 && notNull != nullptr) {
+  memset(notNull, 1, numValues);
+  while (numNulls > 0) {
+uint64_t pos = static_cast(std::rand()) % numValues;
+if (notNull[pos]) {
+  notNull[pos] = static_cast(0);
+  --numNulls;
+}
+  }
+}
+
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i])
+  {
+if (!random) {
+  data[i] = start + delta * static_cast(i);
+} else {
+  data[i] = std::rand();
+}
+  }
+}
+  }
+
+  void decodeAndVerify(
+   RleVersion version,
+   const MemoryOutputStream& memStream,
+   int64_t * data,
+   uint64_t numValues,
+   const char* notNull,
+   bool isSinged) {
+std::unique_ptr decoder = createRleDecoder(
+std::unique_ptr(new 
SeekableArrayInputStream(
+memStream.getData(),
+memStream.getLength())),
+isSinged, version, *getDefaultPool());
+
+int64_t* decodedData = new int64_t[numValues];
+decoder->next(decodedData, numValues, notNull);
+
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (!notNull || notNull[i]) {
+EXPECT_EQ(data[i], decodedData[i]);
+  }
+}
+
+delete [] decodedData;
+  }
+
+  std::unique_ptr getEncoder(RleVersion version,
+MemoryOutputStream& memStream,
+bool isSigned)
+  {
+MemoryPool * pool = getDefaultPool();
+
+return createRleEncoder(
+std::unique_ptr(
+new BufferedOutputStream(*pool, , 500 * 
1024, 1024)),
+isSigned, version, *pool, true);
--- End diff --

can we template these tests for `alignedBitpacking =  false`?


---


[GitHub] orc issue #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-03 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/277
  
There are indeed valgrind failures. I will push a followup patch to fix 
these.


---


[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests

2018-06-03 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/277

ORC-372: Enable valgrind for C++ travis-ci tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-372

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/277.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #277


commit 621f6467ace049c90b3044e67c274f9d276b3a0d
Author: Deepak Majeti 
Date:   2018-06-03T16:52:39Z

ORC-372: Enable valgrind for C++ travis-ci tests




---


[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2

2018-06-03 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/273
  
The PR looks overall good to me apart from a minor change requested. This 
is an important patch to align the C++ and Java implementations. Thanks again 
for working on this!


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-06-03 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r192593861
  
--- Diff: c++/src/RLEv2.hh ---
@@ -25,13 +25,89 @@
 
 #include 
 
+#define MIN_REPEAT 3
+#define HIST_LEN 32
 namespace orc {
 
-class RleDecoderV2 : public RleDecoder {
+struct FixedBitSizes {
+enum FBS {
+ONE = 0, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, 
ELEVEN, TWELVE,
+THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN, 
NINETEEN,
+TWENTY, TWENTYONE, TWENTYTWO, TWENTYTHREE, TWENTYFOUR, TWENTYSIX,
+TWENTYEIGHT, THIRTY, THIRTYTWO, FORTY, FORTYEIGHT, FIFTYSIX, 
SIXTYFOUR, SIZE
+};
+};
+
+enum EncodingType { SHORT_REPEAT=0, DIRECT=1, PATCHED_BASE=2, DELTA=3 };
+
+struct EncodingOption {
+  EncodingType encoding;
+  int64_t fixedDelta;
+  int64_t gapVsPatchListCount;
+  int64_t zigzagLiteralsCount;
+  int64_t baseRedLiteralsCount;
+  int64_t adjDeltasCount;
+  uint32_t zzBits90p;
+  uint32_t zzBits100p;
+  uint32_t brBits95p;
+  uint32_t brBits100p;
+  uint32_t bitsDeltaMax;
+  uint32_t patchWidth;
+  uint32_t patchGapWidth;
+  uint32_t patchLength;
+  int64_t min;
+  bool isFixedDelta;
+};
+
+class RleEncoderV2 : public RleEncoder {
 public:
+RleEncoderV2(std::unique_ptr outStream, bool 
hasSigned, bool alignBitPacking = true);
--- End diff --

`alignedBitPacking` is always true. Should we add a WriterOption to 
enable/disable it?
Java uses the Encoding Strategy to choose this. C++ currently does not have 
this.
```
java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java:144
if (writer.getEncodingStrategy().equals(OrcFile.EncodingStrategy.SPEED)) {
 alignedBitpacking = true;
}
```


---


[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...

2018-06-03 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/275
  
Since there is some investigation needed here, I am going to merge this 
patch. We can enable `NO_SASL` build in a later patch. Right now, this is 
causing a build failure by default.


---


[GitHub] orc pull request #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SAS...

2018-05-30 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/275

ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is not found



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-371

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #275


commit 65c537767474708a91cf5a15cae45acf4fcea552
Author: Deepak Majeti 
Date:   2018-05-30T21:46:08Z

ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is not found




---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191832950
  
--- Diff: c++/src/Writer.cc ---
@@ -122,9 +127,17 @@ namespace orc {
   }
 
   WriterOptions& WriterOptions::setFileVersion(const FileVersion& version) 
{
-// Only Hive_0_11 version is supported currently
-if (version.getMajor() == 0 && version.getMinor() == 11) {
+// Only Hive_0_11 and Hive_0_12 version are supported currently
+if (version.getMajor() == 0 && (version.getMinor() == 11 || 
version.getMinor() == 12)) {
--- End diff --

My suggestion is to use this logic to implement 
`WriterOptions::getRleVersion()`.


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191830703
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
--- End diff --

To be clear, do we need this `RleVersion` here and in the `WriterOptions`?


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191829402
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
--- End diff --

I think the file version should determine the `RleVersion`. Refer 
`isNewWriteFormat` and `isDirectV2` on the Java side.



---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191816749
  
--- Diff: c++/test/TestWriter.cc ---
@@ -47,7 +47,6 @@ namespace orc {
   const Type& type,
   MemoryPool* memoryPool,
   OutputStream* stream,
-  RleVersion rleVersion,
   FileVersion version = FileVersion(0, 
12)){
--- End diff --

`FileVersion::v_0_12()`


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191816344
  
--- Diff: c++/test/TestWriter.cc ---
@@ -139,7 +136,6 @@ namespace orc {
   *type,
   pool,
   ,
-  rleVersion,
   FileVersion(0, 11));
--- End diff --

`FileVersion::v_0_11()`


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191815756
  
--- Diff: c++/test/TestWriter.cc ---
@@ -1174,5 +1170,5 @@ namespace orc {
 }
   }
 
-  INSTANTIATE_TEST_CASE_P(OrcTest, WriterTest, Values(RleVersion_1, 
RleVersion_2));
+  INSTANTIATE_TEST_CASE_P(OrcTest, WriterTest, 
Values(FileVersion::v_0_11(), FileVersion::v_0_11()));
--- End diff --

Should be `Values(FileVersion::v_0_11(), FileVersion::v_0_12()))`


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-29 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191615840
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
--- End diff --

Good point! Should we just add `FixedBitSizes::SIZE` as another element and 
use it then? 


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191291449
  
--- Diff: c++/src/RleEncoderV2.cc ---
@@ -0,0 +1,768 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "Adaptor.hh"
+#include "Compression.hh"
+#include "RLEv2.hh"
+#include "RLEV2Util.hh"
+
+#define MAX_LITERAL_SIZE 512
+#define MAX_SHORT_REPEAT_LENGTH 10
+
+namespace orc {
+
+/**
+ * Compute the bits required to represent pth percentile value
+ * @param data - array
+ * @param p - percentile value (=0.0 to =1.0)
+ * @return pth percentile bits
+ */
+uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t 
length, double p, bool reuseHist) {
+if ((p > 1.0) || (p <= 0.0)) {
+throw InvalidArgument("Invalid p value: " + std::to_string(p));
+}
+
+if (!reuseHist) {
+// histogram that store the encoded bit requirement for each 
values.
+// maximum number of bits that can encoded is 32 (refer 
FixedBitSizes)
+memset(histgram, 0, 32 * sizeof(int32_t));
--- End diff --

Use `FixedBitSizes::LAST` instead of 32?


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191291163
  
--- Diff: c++/src/RLEv2.hh ---
@@ -25,13 +25,89 @@
 
 #include 
 
+#define MIN_REPEAT 3
+#define HIST_LEN 32
 namespace orc {
 
-class RleDecoderV2 : public RleDecoder {
+struct FixedBitSizes {
+enum FBS {
+ONE = 0, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, 
ELEVEN, TWELVE,
+THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN, 
NINETEEN,
+TWENTY, TWENTYONE, TWENTYTWO, TWENTYTHREE, TWENTYFOUR, TWENTYSIX,
+TWENTYEIGHT, THIRTY, THIRTYTWO, FORTY, FORTYEIGHT, FIFTYSIX, 
SIXTYFOUR
--- End diff --

can you add another element `LAST=SIXTYFOUR` towards the end?


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191275478
  
--- Diff: c++/src/Writer.cc ---
@@ -38,9 +38,10 @@ namespace orc {
 FileVersion fileVersion;
 double dictionaryKeySizeThreshold;
 bool enableIndex;
+RleVersion rleVersion;
 
 WriterOptionsPrivate() :
-fileVersion(0, 11) { // default to Hive_0_11
+fileVersion(0, 12) { // default to Hive_0_12
--- End diff --

We should use the static constants proposed in PR 
https://github.com/apache/orc/pull/274 moving forward.


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191269083
  
--- Diff: c++/src/CMakeLists.txt ---
@@ -179,15 +179,15 @@ set(SOURCE_FILES
   OrcFile.cc
   Reader.cc
   RLEv1.cc
-  RLEv2.cc
+  RleDecoderV2.cc
+  RleEncoderV2.cc
--- End diff --

We split the Encoder and Decoder into two files for V2 and not for V1. Can 
we combine them into a single file for V2 as well?


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191271607
  
--- Diff: c++/src/RleDecoderV2.cc ---
@@ -1,10 +1,10 @@
 /**
  * Licensed to the Apache Software Foundation (ASF) under one
  * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
+ * distributed with option work for additional information
--- End diff --

The Apache license header must not change.


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191267866
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -164,6 +169,16 @@ namespace orc {
  */
 std::ostream * getErrorStream() const;
 
+/**
+ * Set the RLE version.
+ */
+WriterOptions& setRleVersion(RleVersion version);
--- End diff --

`WriterOptions& setRleVersion(const RleVersion& version);`


---


[GitHub] orc issue #274: ORC-368:[C++] Reader must return default version 0.11 instea...

2018-05-26 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/274
  
+1 LGTM


---


[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2

2018-05-25 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/273
  
@xndai  and @yuruiz  thanks for contributing this code. I will take a look 
at this.


---


[GitHub] orc issue #265: ORC-334: [C++] Add AppVeyor support for integration on windo...

2018-05-16 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/265
  
Added an INFRA ticket to do this here 
https://issues.apache.org/jira/browse/INFRA-16535


---


[GitHub] orc issue #265: ORC-334: [C++] Add AppVeyor support for integration on windo...

2018-05-14 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/265
  
+1 LGTM


---


[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...

2018-05-14 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/265#discussion_r187922554
  
--- Diff: c++/src/Timezone.cc ---
@@ -710,7 +710,11 @@ namespace orc {
* Get the local timezone.
*/
   const Timezone& getLocalTimezone() {
+#ifdef _MSC_VER
+return getTimezoneByName("UTC");
--- End diff --

I also think that it is better to leave the conversion to the 
client/customer. We should ideally change the conversion for Nix* systems but 
cannot due to backward compatibility. We should be able to converge both Java 
and C++ timestamps in ORC 2.0.
For Windows, since this is the first official build support, we should be 
okay to use "UTC" and document this behavior.
I will merge this PR end of today if there are no objections.


---


[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...

2018-05-07 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/265#discussion_r186409327
  
--- Diff: c++/src/Timezone.cc ---
@@ -710,7 +710,11 @@ namespace orc {
* Get the local timezone.
*/
   const Timezone& getLocalTimezone() {
+#ifdef _MSC_VER
+return getTimezoneByName("UTC");
--- End diff --

Can you comment on why we look at `UTC` for windows instead of 
`LOCAL_TIMEZONE`?


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181155751
  
--- Diff: site/_docs/encodings.md ---
@@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes  | Boolean 
RLE
   | DATA| No   | Unbounded base 128 varints
   | SECONDARY   | No   | Unsigned Integer RLE v2
 
+In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
+stream is totally removed as all decimal values use the same scale.
+There are two difference cases: precision<=18 and precision>18.
+
+### Decimal Encoding for precision <= 18
+
+When precision is no greater than 18, decimal values can be fully
+represented by 64-bit signed integers which are stored in DATA stream
+and use signed integer RLE.
+
+Encoding  | Stream Kind | Optional | Contents
+: | :-- | :--- | :---
+DECIMAL   | PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v1
+DECIMAL_V2| PRESENT | Yes  | Boolean RLE
+  | DATA| No   | Signed Integer RLE v2
--- End diff --

@xndai Vertica is interested in getting RLE v2 for C++ as well. Do you 
think we can collaborate on getting this in quickly?


---


[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/245#discussion_r181096194
  
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of 
false and true values.
 }
 ```
 
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
 
 ```message DecimalStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional string sum = 3;
+  message Int128 {
+   repeated sint64 highBits = 1;
+   repeated uint64 lowBits = 2;
--- End diff --

shouldn't this be sint64 as well since we are using uint64 for the 
SECONDARY stream?


---


[GitHub] orc pull request #243: Update the site with more information about developin...

2018-04-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/243#discussion_r180608621
  
--- Diff: site/develop/committers.md ---
@@ -0,0 +1,62 @@
+---
+layout: page
+title: Project Members
+---
+
+## Project Members
+
+{% comment %}
+please sort by Apache Id
+{% endcomment %}
+Name| Apache Id| Role
+:-- | :--- | :---
+Aliaksei Sandryhaila| asandryh | PMC
+Chris Douglas   | cdouglas | PMC
+Chinna Rao Lalam| chinnaraol   | Committer
+Chaoyu Tang | ctang| Committer
+Carl Steinbach  | cws  | Committer
+Daniel Dai  | daijy| Committer
+Deepak Majeti   | mdeepak  | PMC
+Eugene Koifman  | ekoifman | PMC
+Gang Wu | gangwu   | Committer
+Alan Gates  | gates| PMC
+Gopal Vijayaraghavan| gopalv   | PMC
+Gunther Hagleitner  | gunther  | Committer
+Ashutosh Chauhan| hashutosh| Committer
+Jesus Camacho Rodriguez | jcamacho | Committer
+Jason Dere  | jdere| Committer
+Jimmy Xiang | jxiang   | Committer
+Kevin Wilfong   | kevinwilfong | Committer
+Lars Francke| larsfrancke  | Committer
+Lefty Leverenz  | leftyl   | PMC
+Rui Li  | lirui| Committer
+Mithun Radhakrishnan| mithun   | Committer
+Matthew McCline | mmccline | Committer
+Naveen Gangam   | ngangam  | Committer
+Owen O'Malley   | omalley  | PMC
+Prasanth Jayachandran   | prasanthj| PMC
+Pengcheng Xiong | pxiong   | Committer
+Rajesh Balamohan| rbalamohan   | Committer
+Sergey Shelukhin| sershe   | Committer
+Sergio Pena | spena| Committer
+Siddharth Seth  | sseth| Committer
+Stephen Walkauskas  | swalkaus | Committer
+Vaibhav Gumashta| vgumashta| Committer
+Wei Zheng   | weiz | Committer
+Xiening Dai | xndai| Committer
+Xuefu Zhang | xuefu| Committer
+Ferdinand Xu| xuf  | Committer
+Yongzhi Chen| ychena   | Committer
+Aihua Xu| zihuaxu  | Committer
+
+Companies with employees that are committers:
+
+* Alibaba
+* Cloudera
+* Facebook
+* Hewlett Packard Enterprise
--- End diff --

Can we add Vertica instead of HPE? 


---


[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-04-03 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/233
  
You are right about the last commit. It is redundant! I will remove that 
commit and merge this. Thanks!


---


[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-04-03 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/233
  
Can we extend this by adding a `WriterOption` to provide a timezone name 
(default to "GMT")?


---


[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-22 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/233
  
I also think that we should adjust the value being read from the C++ reader 
w.r.t to the reader and writer timezones (if they are different) like the Java 
reader implementation. The current C++ behavior is definitely inconsistent with 
Java (which does things the right way).


---


[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-19 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/233
  
@wgtmac and @stiga-huang you are right that the C++ and Java writers must 
write the same value to a file for a given input timestamp value. Looks like 
the Java side writes the timestamp values provided as is in local time (no 
conversion) and writes the writer timezone in the footer (however, stats are in 
UTC). We must do the same for the C++ writer as well if not already.

ORC-10 adds GMT offset when reading the values back. Therefore, the C++ 
reader always returns values in UTC. The current behavior of ORC reader for 
timestamp values is the same as SQL `TimestampTz`.
To get the same values back ( aka SQL `Timestamp`), you need to convert the 
values read back to local time.

If you read a timestamp column from an ORC file and plan to write it 
immediately, you must first convert the values to the local time before writing.


---


[GitHub] orc issue #225: ORC-313,ORC-317: Check types in footer

2018-03-15 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/225
  
@stiga-huang  I will look into these this week. Can you fix the title of 
this PR? Thanks!


---


[GitHub] orc issue #225: ORC-313,ORC-317: Check types in footer

2018-03-11 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/225
  
@stiga-huang can you open a separate PR or each JIRA? Thanks.


---


[GitHub] orc issue #216: ORC-284: [C++] add missing tests for C++ tools

2018-02-11 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/216
  
+1 LGTM. Thanks @wgtmac !


---


[GitHub] orc issue #199: ORC-276: [C++] Create a simple tool to import CSV files

2018-01-11 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/199
  
Looks good! Thank you!


---


[GitHub] orc issue #199: ORC-276: [C++] Create a simple tool to import CSV files

2017-12-21 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/199
  
I just realized we need to add tests for each tool as well.  `/tools/test` 
has some examples.
Some of the tools are missing tests as well. I will file a JIRA to cover 
those.
Sorry for not noticing this earlier.


---


[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r158422184
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,476 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static char gDelimiter = ',';
+
+// extract one column raw text from one line
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(gDelimiter);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(gDelimiter, start);
+++col;
+  }
+  return col == colIndex ? s.substr(start, end - start) : "";
+}
+
+static const char* GetDate(void) {
+  static char buf[200];
+  time_t t = time(NULL);
+  struct tm* p = localtime();
+  strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p);
+  return buf;
+}
+
+void fillLongValues(const std::vector& data,
+orc::ColumnVectorBatch* batch,
+uint64_t numValues,
+uint64_t colIndex) {
+  orc::LongVectorBatch* longBatch =
+dynamic_cast<orc::LongVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  longBatch->data[i] = atoll(col.c_str());
+}
+  }
+  longBatch->hasNulls = hasNull;
+  longBatch->numElements = numValues;
+}
+
+void fillStringValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex,
+  orc::DataBuffer& buffer,
+  uint64_t& offset) {
+  orc::StringVectorBatch* stringBatch =
+dynamic_cast<orc::StringVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  if (buffer.size() - offset < col.size()) {
+buffer.reserve(buffer.size() * 2);
+  }
+  memcpy(buffer.data() + offset,
+ col.c_str(),
+ col.size());
+  stringBatch->data[i] = buffer.data() + offset;
+  stringBatch->length[i] = static_cast(col.size());
+  offset += col.size();
+}
+  }
+  stringBatch->hasNulls = hasNull;
+  stringBatch->numElements = numValues;
+}
+
+void fillDoubleValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex) {
+  orc::DoubleVectorBatch* dblBatch =
+dynamic_cast<orc::DoubleVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  dblBatch->data[i] = atof(col.c_str());
+}
+  }
+  dblBatch->hasNulls = hasNull;
+  dblBatch->numElements = numValues;
+}
+
+// parse fixed point decimal numbers
+void fillDecimalValues(const std::vector& data,
+   orc::ColumnVectorBatch* batch,
+   

[GitHub] orc pull request #204: ORC-283: Enable the cmake build to pick specified pat...

2017-12-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/204#discussion_r158269151
  
--- Diff: cmake_modules/FindGTest.cmake ---
@@ -28,7 +28,7 @@ find_path (GTEST_INCLUDE_DIR gmock/gmock.h HINTS
   NO_DEFAULT_PATH
   PATH_SUFFIXES "include")
 
-find_library (GTEST_LIBRARIES NAMES gmock PATHS
+find_library (GTEST_LIBRARIES NAMES gmock HINTS
--- End diff --

`HINTS` is apt here. `PATHS` must only be used for hardcoded guesses.
https://cmake.org/cmake/help/v3.0/command/find_library.html


---


[GitHub] orc pull request #204: ORC-283: Enable the cmake build to pick specified lib...

2017-12-21 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/204

ORC-283: Enable the cmake build to pick specified libraries over the …

…default libraries

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-283

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #204


commit 1a34380e7eb6323e3f0b92b377943f1ba5562d5e
Author: Deepak Majeti <mdeepak@...>
Date:   2017-12-21T12:17:12Z

ORC-283: Enable the cmake build to pick specified libraries over the 
default libraries




---


[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155629237
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static char gDelimiter = ',';
+
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(gDelimiter);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(gDelimiter, start);
+++col;
+  }
+  return col == colIndex ? s.substr(start, end - start) : "";
+}
+
+static const char* GetDate(void)
+{
+  static char buf[200];
+  time_t t = time(NULL);
+  struct tm* p = localtime();
+  strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p);
+  return buf;
+}
+
+void fillLongValues(const std::vector& data,
+orc::ColumnVectorBatch* batch,
+uint64_t numValues,
+uint64_t colIndex) {
+  orc::LongVectorBatch* longBatch =
+dynamic_cast<orc::LongVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  longBatch->data[i] = atoll(col.c_str());
+}
+  }
+  longBatch->hasNulls = hasNull;
+  longBatch->numElements = numValues;
+}
+
+void fillStringValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex,
+  orc::DataBuffer& buffer,
+  uint64_t& offset) {
+  orc::StringVectorBatch* stringBatch =
+dynamic_cast<orc::StringVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  if (buffer.size() - offset < col.size()) {
+buffer.reserve(buffer.size() * 2);
+  }
+  memcpy(buffer.data() + offset,
+ col.c_str(),
+ col.size());
+  stringBatch->data[i] = buffer.data() + offset;
+  stringBatch->length[i] = static_cast(col.size());
+  offset += col.size();
+}
+  }
+  stringBatch->hasNulls = hasNull;
+  stringBatch->numElements = numValues;
+}
+
+void fillDoubleValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex) {
+  orc::DoubleVectorBatch* dblBatch =
+dynamic_cast<orc::DoubleVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  dblBatch->data[i] = atof(col.c_str());
+}
+  }
+  dblBatch->hasNulls = hasNull;
+  dblBatch->numElements = numValues;
+}
+
+// parse fixed point decimal numbers
+void fillDecimalValues(const std::vector& data,
+   orc::ColumnVectorBatch* batch,
+   uint64_t numValues,
+   

[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155995744
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static char gDelimiter = ',';
+
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(gDelimiter);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(gDelimiter, start);
+++col;
+  }
+  return col == colIndex ? s.substr(start, end - start) : "";
+}
+
+static const char* GetDate(void)
+{
+  static char buf[200];
+  time_t t = time(NULL);
+  struct tm* p = localtime();
+  strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p);
+  return buf;
+}
+
+void fillLongValues(const std::vector& data,
+orc::ColumnVectorBatch* batch,
+uint64_t numValues,
+uint64_t colIndex) {
+  orc::LongVectorBatch* longBatch =
+dynamic_cast<orc::LongVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  longBatch->data[i] = atoll(col.c_str());
+}
+  }
+  longBatch->hasNulls = hasNull;
+  longBatch->numElements = numValues;
+}
+
+void fillStringValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex,
+  orc::DataBuffer& buffer,
+  uint64_t& offset) {
+  orc::StringVectorBatch* stringBatch =
+dynamic_cast<orc::StringVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  if (buffer.size() - offset < col.size()) {
+buffer.reserve(buffer.size() * 2);
+  }
+  memcpy(buffer.data() + offset,
+ col.c_str(),
+ col.size());
+  stringBatch->data[i] = buffer.data() + offset;
+  stringBatch->length[i] = static_cast(col.size());
+  offset += col.size();
+}
+  }
+  stringBatch->hasNulls = hasNull;
+  stringBatch->numElements = numValues;
+}
+
+void fillDoubleValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex) {
+  orc::DoubleVectorBatch* dblBatch =
+dynamic_cast<orc::DoubleVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  dblBatch->data[i] = atof(col.c_str());
+}
+  }
+  dblBatch->hasNulls = hasNull;
+  dblBatch->numElements = numValues;
+}
+
+// parse fixed point decimal numbers
+void fillDecimalValues(const std::vector& data,
+   orc::ColumnVectorBatch* batch,
+   uint64_t numValues,
+   

[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155629518
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static char gDelimiter = ',';
+
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(gDelimiter);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(gDelimiter, start);
+++col;
+  }
+  return col == colIndex ? s.substr(start, end - start) : "";
+}
+
+static const char* GetDate(void)
+{
+  static char buf[200];
+  time_t t = time(NULL);
+  struct tm* p = localtime();
+  strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p);
+  return buf;
+}
+
+void fillLongValues(const std::vector& data,
+orc::ColumnVectorBatch* batch,
+uint64_t numValues,
+uint64_t colIndex) {
+  orc::LongVectorBatch* longBatch =
+dynamic_cast<orc::LongVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  longBatch->data[i] = atoll(col.c_str());
+}
+  }
+  longBatch->hasNulls = hasNull;
+  longBatch->numElements = numValues;
+}
+
+void fillStringValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex,
+  orc::DataBuffer& buffer,
+  uint64_t& offset) {
+  orc::StringVectorBatch* stringBatch =
+dynamic_cast<orc::StringVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  if (buffer.size() - offset < col.size()) {
+buffer.reserve(buffer.size() * 2);
+  }
+  memcpy(buffer.data() + offset,
+ col.c_str(),
+ col.size());
+  stringBatch->data[i] = buffer.data() + offset;
+  stringBatch->length[i] = static_cast(col.size());
+  offset += col.size();
+}
+  }
+  stringBatch->hasNulls = hasNull;
+  stringBatch->numElements = numValues;
+}
+
+void fillDoubleValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex) {
+  orc::DoubleVectorBatch* dblBatch =
+dynamic_cast<orc::DoubleVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  dblBatch->data[i] = atof(col.c_str());
+}
+  }
+  dblBatch->hasNulls = hasNull;
+  dblBatch->numElements = numValues;
+}
+
+// parse fixed point decimal numbers
+void fillDecimalValues(const std::vector& data,
+   orc::ColumnVectorBatch* batch,
+   uint64_t numValues,
+   

[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...

2017-12-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/199#discussion_r155629814
  
--- Diff: tools/src/CSVFileImport.cc ---
@@ -0,0 +1,436 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Exceptions.hh"
+#include "orc/OrcFile.hh"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static char gDelimiter = ',';
+
+std::string extractColumn(std::string s, uint64_t colIndex) {
+  uint64_t col = 0;
+  size_t start = 0;
+  size_t end = s.find(gDelimiter);
+  while (col < colIndex && end != std::string::npos) {
+start = end + 1;
+end = s.find(gDelimiter, start);
+++col;
+  }
+  return col == colIndex ? s.substr(start, end - start) : "";
+}
+
+static const char* GetDate(void)
+{
+  static char buf[200];
+  time_t t = time(NULL);
+  struct tm* p = localtime();
+  strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p);
+  return buf;
+}
+
+void fillLongValues(const std::vector& data,
+orc::ColumnVectorBatch* batch,
+uint64_t numValues,
+uint64_t colIndex) {
+  orc::LongVectorBatch* longBatch =
+dynamic_cast<orc::LongVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  longBatch->data[i] = atoll(col.c_str());
+}
+  }
+  longBatch->hasNulls = hasNull;
+  longBatch->numElements = numValues;
+}
+
+void fillStringValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex,
+  orc::DataBuffer& buffer,
+  uint64_t& offset) {
+  orc::StringVectorBatch* stringBatch =
+dynamic_cast<orc::StringVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  if (buffer.size() - offset < col.size()) {
+buffer.reserve(buffer.size() * 2);
+  }
+  memcpy(buffer.data() + offset,
+ col.c_str(),
+ col.size());
+  stringBatch->data[i] = buffer.data() + offset;
+  stringBatch->length[i] = static_cast(col.size());
+  offset += col.size();
+}
+  }
+  stringBatch->hasNulls = hasNull;
+  stringBatch->numElements = numValues;
+}
+
+void fillDoubleValues(const std::vector& data,
+  orc::ColumnVectorBatch* batch,
+  uint64_t numValues,
+  uint64_t colIndex) {
+  orc::DoubleVectorBatch* dblBatch =
+dynamic_cast<orc::DoubleVectorBatch*>(batch);
+  bool hasNull = false;
+  for (uint64_t i = 0; i < numValues; ++i) {
+std::string col = extractColumn(data[i], colIndex);
+if (col.empty()) {
+  batch->notNull[i] = 0;
+  hasNull = true;
+} else {
+  batch->notNull[i] = 1;
+  dblBatch->data[i] = atof(col.c_str());
+}
+  }
+  dblBatch->hasNulls = hasNull;
+  dblBatch->numElements = numValues;
+}
+
+// parse fixed point decimal numbers
+void fillDecimalValues(const std::vector& data,
+   orc::ColumnVectorBatch* batch,
+   uint64_t numValues,
+   

[GitHub] orc issue #196: ORC-270: fix target_link_libraries for tool-test

2017-12-02 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/196
  
Thanks for the patch!


---


[GitHub] orc pull request #195: ORC-269: cmake fails when PROTOBUF_HOME set and libhd...

2017-11-29 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/195

ORC-269: cmake fails when PROTOBUF_HOME set and libhdfs is built



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-269

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195


commit b4fecf9bdd84cd926d6af8e65903f876ebb3b11f
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-11-30T01:46:36Z

ORC-269: cmake fails when PROTOBUF_HOME set and libhdfs is built




---


[GitHub] orc pull request #191: ORC-265: [C++] Add documentation for C++ build suppor...

2017-11-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/191#discussion_r153647593
  
--- Diff: site/_docs/building.md ---
@@ -70,3 +70,25 @@ To build:
 % mvn package
 ~~~
 
+## Building just C++
+
+~~~ shell
+% mkdir build
+% cd build
+% cmake .. -DBUILD_JAVA=OFF
+% make package test-out
+~~~
+
+## Specify third-party libraries for C++ build
+
+~~~ shell
+% mkdir build
+% cd build
+% GTEST_HOME= \
+  SNAPPY_HOME= \
+  ZLIB_HOME= \
+  LZ4_HOME= \
+  PROTOBUF_HOME= \
+  cmake  .. -DBUILD_JAVA=OFF
+% make package test-out
--- End diff --

That is correct. The recent changes made did put the dependency on CMake 
variables. Fixed the comments.


---


[GitHub] orc pull request #191: ORC-265: [C++] Add documentation for C++ build suppor...

2017-11-28 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/191#discussion_r153607648
  
--- Diff: site/_docs/building.md ---
@@ -70,3 +70,25 @@ To build:
 % mvn package
 ~~~
 
+## Building just C++
+
+~~~ shell
+% mkdir build
+% cd build
+% cmake .. -DBUILD_JAVA=OFF
+% make package test-out
+~~~
+
+## Specify third-party libraries for C++ build
+
+~~~ shell
+% mkdir build
+% cd build
+% GTEST_HOME= \
+  SNAPPY_HOME= \
+  ZLIB_HOME= \
+  LZ4_HOME= \
+  PROTOBUF_HOME= \
+  cmake  .. -DBUILD_JAVA=OFF
+% make package test-out
--- End diff --

@wgtmac  That is correct. We manually set it in travis-ci testing as well.
 ``cmake -DOPENSSL_ROOT_DIR=`brew --prefix openssl` ..``
I will check with Anatoli Shein if we can fix this / or add documentation 
here. Thanks!


---


[GitHub] orc pull request #192: Cleanup cmake scripts

2017-11-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/192#discussion_r152657567
  
--- Diff: cmake_modules/ThirdpartyToolchain.cmake ---
@@ -41,32 +68,35 @@ if (NOT SNAPPY_FOUND)
 LOG_BUILD 1
 LOG_INSTALL 1
 BUILD_BYPRODUCTS "${SNAPPY_STATIC_LIB}")
+
+  set (SNAPPY_VENDORED TRUE)
 endif ()
-include_directories (SYSTEM ${SNAPPY_INCLUDE_DIRS})
+
+include_directories (SYSTEM ${SNAPPY_INCLUDE_DIR})
 add_library (snappy STATIC IMPORTED)
 set_target_properties (snappy PROPERTIES IMPORTED_LOCATION 
${SNAPPY_STATIC_LIB})
-set (SNAPPY_LIBRARIES snappy)
-add_dependencies (snappy snappy_ep)
-install(DIRECTORY ${SNAPPY_PREFIX}/lib DESTINATION .
--- End diff --

AFAIK `cpack` depends on `install` as well to make a package. So I would 
vote for your other option of using `INSTALL_THIRDPARTY_LIBS=on`.


---


[GitHub] orc pull request #192: Cleanup cmake scripts

2017-11-22 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/192#discussion_r152654881
  
--- Diff: cmake_modules/ThirdpartyToolchain.cmake ---
@@ -10,19 +10,46 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+set (LZ4_VERSION "1.7.5")
+set (SNAPPY_VERSION "1.1.4")
+set (ZLIB_VERSION "1.2.11")
+set (GTEST_VERSION "1.8.0")
+set (PROTOBUF_VERSION "2.6.0")
+
 set (THIRDPARTY_DIR "${CMAKE_BINARY_DIR}/c++/libs/thirdparty")
 
 string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_BUILD_TYPE)
 
+if (DEFINED ENV{SNAPPY_HOME})
+  set (SNAPPY_HOME "$ENV{SNAPPY_HOME}")
+endif ()
+
+if (DEFINED ENV{ZLIB_HOME})
+  set (ZLIB_HOME "$ENV{ZLIB_HOME}")
+endif ()
+
+if (DEFINED ENV{LZ4_HOME})
+  set (LZ4_HOME "$ENV{LZ4_HOME}")
+endif ()
+
+if (DEFINED ENV{PROTOBUF_HOME})
+  set (PROTOBUF_HOME "$ENV{PROTOBUF_HOME}")
+endif ()
+
+if (DEFINED ENV{GTEST_HOME})
+  set (GTEST_HOME "$ENV{GTEST_HOME}")
+endif ()
+
 # --
 # Snappy
 
-set (SNAPPY_HOME "$ENV{SNAPPY_HOME}")
-find_package (Snappy)
-if (NOT SNAPPY_FOUND)
+if (NOT "${SNAPPY_HOME}" STREQUAL "")
+  find_package (Snappy REQUIRED)
+  set(SNAPPY_VENDORED FALSE)
+else ()
--- End diff --

I would like the library to be vendored in the case where `SNAPPY_HOME` is 
set, but `SNAPPY_FOUND` is inferred as `false`


---


[GitHub] orc pull request #191: ORC-265: [C++] Add documentation for C++ build suppor...

2017-11-15 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/191

ORC-265: [C++] Add documentation for C++ build support



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-265

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/191.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #191


commit f13da19acdbf27a9310578b97ec133d11f0acc28
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-11-15T22:56:56Z

ORC-265: [C++] Add documentation for C++ build support




---


[GitHub] orc issue #170: ORC-207: [C++] Enable users the ability to provide their own...

2017-11-07 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/170
  
@jcrist  Will review that PR. Thanks!


---


[GitHub] orc issue #149: ORC-224: Implement column writers of primitive types

2017-11-06 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/149
  
@wgtmac Sorry for the delay. I was tied up last week. I will definitely 
work on this today/tomorrow.


---


[GitHub] orc issue #183: ORC-258: [C++] Incorrect Decimal constructor

2017-10-31 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/183
  
@wgtmac will do. I will make another pass today and merge it. Thanks!


---


[GitHub] orc issue #170: ORC-207: [C++] Enable users the ability to provide their own...

2017-09-26 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/170
  
In this PR, `FindProtobuf.cmake` checks if the specified path 
`PROTOBUF_HOME` contains the required headers and sets `PROTOBUF_INCLUDE_DIRS`, 
checks for the library and sets `PROTOBUF_LIBRARIES`, and checks for the 
executable and sets `PROTOBUF_EXECUTABLE`. Even if one of them is missing from 
the user specified location, the protobuf library gets downloaded. In a future 
patch, we can extend the `FindX.cmake` to check default system paths (except 
for protobuf). But by default, we should only download.


---


[GitHub] orc issue #170: ORC-207: [C++] Enable users the ability to provide their own...

2017-09-26 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/170
  
I did not understand the first point. `PROTOBUF_HOME` must be explicitly 
set to use the user specified version. Otherwise, the library gets downloaded.

Will remove the travis-ci test.
Thanks!


---


[GitHub] orc pull request #170: ORC-207: [C++] Enable users the ability to provide th...

2017-09-19 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/170

ORC-207: [C++] Enable users the ability to provide their own thirdparty 
libraries



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-207

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #170


commit aba65fc12641cf30098c850bf29ca7032023e66a
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-09-19T18:37:29Z

ORC-207: Enable users the ability to provide third-party libraries




---


[GitHub] orc issue #168: Install missing Statistics.hh header file

2017-09-19 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/168
  
Can you file a JIRA and append the JIRA number to the PR? Thanks.


---


[GitHub] orc issue #151: ORC-226 Support getWriterId in c++ reader interface

2017-09-13 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/151
  
@xndai  can you please squash your commits? 


---


[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134319378
  
--- Diff: c++/test/TestWriter.cc ---
@@ -209,5 +209,612 @@ namespace orc {
 }
 EXPECT_FALSE(rowReader->next(*batch));
   }
-}
 
+  TEST(Writer, writeStringAndBinaryColumn) {
--- End diff --

See google typed tests. Those should help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134308560
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134308863
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134304957
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134309488
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134301225
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134311129
  
--- Diff: c++/test/TestWriter.cc ---
@@ -209,5 +209,612 @@ namespace orc {
 }
 EXPECT_FALSE(rowReader->next(*batch));
   }
-}
 
+  TEST(Writer, writeStringAndBinaryColumn) {
--- End diff --

These tests can definitely be improved by using a templated test class 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-08-21 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r134310213
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc issue #134: Orc 17

2017-08-16 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/134
  
Can you check if `thread_local` is supported by the platform, and then 
disable libhdfspp on those platforms?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #134: Orc 17

2017-08-15 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/134
  
I am adding a Travis test for OSX os. It should ease catching OSX issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #149: ORC-224: Implement column writers of primitive types

2017-08-15 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/149
  
@wgtmac Sorry for taking longer. I will definitely try to complete this as 
soon as possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #155: ORC-227: Fix ExternalProject_Add

2017-08-14 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/155

ORC-227: Fix ExternalProject_Add

@omalley This commit got left out from the previous PR

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-227

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #155


commit bceaecd8d0195e9ef27adbdf41515f1900730652
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-08-10T14:37:32Z

Fix ExternalProject_Add




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #152: ORC-227: [C++] Fix docker failure due to ExternalProj...

2017-08-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/152#discussion_r132529457
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -26,9 +26,9 @@
 
 namespace orc {
 
-  const size_t MINIMUM_REPEAT = 3;
-  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
-  const size_t MAX_LITERAL_SIZE = 128;
+  const int MINIMUM_REPEAT = 3;
+  const int MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const int MAX_LITERAL_SIZE = 128;
--- End diff --

I tried CentOS6 docker image. The warnings seem to be correct.
```
cc1plus: warnings being treated as errors
/root/orc/c++/src/ByteRLE.cc: In member function ‘void 
orc::ByteRleEncoderImpl::write(char)’:
/root/orc/c++/src/ByteRLE.cc:152: error: comparison between signed and 
unsigned integer expressions
/root/orc/c++/src/ByteRLE.cc:166: error: comparison between signed and 
unsigned integer expressions
/root/orc/c++/src/ByteRLE.cc:167: error: comparison between signed and 
unsigned integer expressions
/root/orc/c++/src/ByteRLE.cc:179: error: comparison between signed and 
unsigned integer expressions
make[2]: *** [c++/src/CMakeFiles/orc.dir/ByteRLE.cc.o] Error 1
make[1]: *** [c++/src/CMakeFiles/orc.dir/all] Error 2
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #152: ORC-227: [C++] Fix docker failure due to ExternalProj...

2017-08-10 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/152#discussion_r132477368
  
--- Diff: c++/src/ByteRLE.cc ---
@@ -26,9 +26,9 @@
 
 namespace orc {
 
-  const size_t MINIMUM_REPEAT = 3;
-  const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
-  const size_t MAX_LITERAL_SIZE = 128;
+  const int MINIMUM_REPEAT = 3;
+  const int MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT;
+  const int MAX_LITERAL_SIZE = 128;
--- End diff --

@xndai @wgtmac These changes are required to avoid signed and unsigned 
mismatch warnings. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #149: ORC-224: Implement column writers of primitive types

2017-08-10 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/149
  
@wgtmac I will take a look at this today/tomorrow. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #143: [C++] Remove gmock and protobuf libraries from source...

2017-07-27 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/143

[C++] Remove gmock and protobuf libraries from source and use 
ExternalProject instead

Similar to ORC-204
Files modified
```
c++/CMakeLists.txt
c++/src/CMakeLists.txt
c++/test/CMakeLists.txt
cmake_modules/ThirdpartyToolchain.cmake
tools/test/CMakeLists.txt
CMakeLists.txt
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-215

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #143


commit e92ea33aa33e9a2906941d3165476b836c09af92
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-07-27T16:55:40Z

ORC-215: Use ExternalProject_Add for gmock

commit 350e8648e4025498a0cb411825c9dcf2ace89994
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-07-27T18:17:26Z

use ExternalProject_Add for protobuf




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #142: [ORC-218] Cache timezone information in the library.

2017-07-27 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/142
  
Please add another travis-ci test with EMBEDDED_TZ_DB=ON to ensure this 
change gets tested 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #135: ORC-204: Update and use CMake ExternalProject_Add to build c...

2017-07-22 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/135
  
Fixed packaging static compression libraries. Only the source is available 
for Snappy 1.1.6 and I could not `install`(build works fine) this release. 
CMake support was introduced but only shared libraries are being built. Snappy 
1.1.5 and 1.1.6 are the same. There is no performance difference between the 
current 1.1.4 and 1.1.6 
 https://github.com/google/snappy/releases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #135: Update and use CMake External Project to build compre...

2017-07-10 Thread majetideepak
GitHub user majetideepak opened a pull request:

https://github.com/apache/orc/pull/135

Update and use CMake External Project to build compression libraries

Including the whole source of external libraries adds bloat.
It also is not useful if clients prefer to use their own third-party 
libraries.
In this PR
1) The libraries are updated to the most recent ones.
2) CMake `ExternalProject_Add`  is used to build the libraries

Files to review
```
cmake_modules/ThirdpartyToolchain.cmake
c++/libs/CMakeLists.txt
CMakeLists.txt
```
Rest of the changes are just deletes of the libraries
I will file a JIRA to enable users the ability to provide their own 
libraries.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majetideepak/orc ORC-204

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #135


commit 0bc85be0a7ba60dc53ab402a5a278503cf98f97d
Author: Deepak Majeti <deepak.maj...@hpe.com>
Date:   2017-07-10T16:43:47Z

Update and use CMake External Project to build compression libraries




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #134: Orc 17

2017-07-07 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/134
  
Can we add libhdfspp as a tarball and add a line to untar it?
I am going to make these changes for other libraries in the lib folder as 
well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-15 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/128
  
@xndai Unfortunately, I don't have access to merge yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120745645
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool* memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = ::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = ();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != ) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+re

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120728928
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool* memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = ::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = ();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != ) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+re

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-07 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120729334
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if 

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120483146
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if 

  1   2   >