[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/301 @wgtmac thanks for these changes! I will take a look at both patches by end of tomorrow. ---
[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/301 @xndai I am not very familiar with the Java side. Looking at the ZSTD support in parquet-mr, it looks like they only added the ZSTD support but skipped the tests. The parquet-mr library has not upgraded the hadoop library as well. https://github.com/apache/parquet-mr/blob/9fa86cca1af7dabc21701247efd89f6085945bd2/pom.xml#L80 https://github.com/apache/parquet-mr/commit/132b2a8c553bdcfd445e88680beac6f225c50ac4#diff-6a038e86a0fc009909af954b3589cd95R159 Can we do something like that here? ---
[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/301 @wgtmac I will take a look at this today. Do you know what is the current behavior on the Java side with this compression format? ---
[GitHub] orc pull request #300: [ORC-394][C++] Add addUserMetadata() function to C++ ...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/300#discussion_r209222998 --- Diff: c++/test/TestWriter.cc --- @@ -1170,5 +1170,52 @@ namespace orc { } } + TEST_P(WriterTest, writeUserMetadata) { --- End diff -- can you add the user metadata check to an existing test? ---
[GitHub] orc pull request #296: [ORC-391][c++] parseType does not accept underscore i...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/296#discussion_r206153654 --- Diff: tools/test/TestCSVFileImport.cc --- @@ -53,3 +53,33 @@ TEST (TestCSVFileImport, test10rows) { EXPECT_EQ(expected, output); EXPECT_EQ("", error); } + +TEST (TestCSVFileImport, test10rows_underscore) { + // create an ORC file from importing the CSV file + const std::string pgm1 = findProgram("tools/src/csv-import"); + const std::string csvFile = findExample("TestCSVFileImport.test10rows.csv"); + const std::string orcFile = "/tmp/test_csv_import_test_10_rows.orc"; + const std::string schema = "struct<_a:bigint,b_:string,c:double>"; --- End diff -- Thanks for the test! can you rename `c` to `c_col`? ---
[GitHub] orc issue #293: ORC-388: Fix isSafeSubtract to use logic operator instead of...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/293 +1 LGTM. logical operator helps in short-circuit evaluation as well. ---
[GitHub] orc issue #296: [c++] column/field name can take underline
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/296 JIRA name and a test please! ---
[GitHub] orc issue #285: ORC-371 Disable libhdfspp build if dependencies are missing
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/285 +1 LGTM. I will check this in by end of today! Thanks. ---
[GitHub] orc issue #289: ORC-384 fix memory leak when loading non-ORC files
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/289 +1 LGTM. I will check this in by end of today. ---
[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/275 Thanks for this bug. I will push a patch to fix this. ---
[GitHub] orc issue #282: ORC-377: [c++] Add SnappyCompressionStream
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/282 Can you add some tests for this? ---
[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/273 +1 LGTM ---
[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/277#discussion_r193159979 --- Diff: c++/src/Compression.cc --- @@ -199,13 +199,16 @@ namespace orc { uint64_t blockSize, MemoryPool& pool); +~ZlibCompressionStream() { end(); } --- End diff -- Thanks! ---
[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/277#discussion_r192919379 --- Diff: c++/src/TypeImpl.cc --- @@ -258,31 +258,34 @@ namespace orc { case STRUCT: { StructVectorBatch *result = new StructVectorBatch(capacity, memoryPool); + std::unique_ptr return_value = std::unique_ptr(result); --- End diff -- we need `result` of type `StructVectorBatch` to access fields on line 263 ---
[GitHub] orc issue #277: ORC-372: Enable valgrind for C++ travis-ci tests
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/277 @wgtmac, @xndai can you take a look at this patch? ZLIB compression code seems to be leaking memory. Thanks! ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r192729045 --- Diff: c++/test/TestRleEncoder.cc --- @@ -0,0 +1,243 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include "MemoryOutputStream.hh" +#include "RLEv1.hh" + +#include "wrap/orc-proto-wrapper.hh" +#include "wrap/gtest-wrapper.h" + +namespace orc { + + const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M + + void generateData( + uint64_t numValues, + int64_t start, + int64_t delta, + bool random, + int64_t* data, + uint64_t numNulls = 0, + char* notNull = nullptr) { +if (numNulls != 0 && notNull != nullptr) { + memset(notNull, 1, numValues); + while (numNulls > 0) { +uint64_t pos = static_cast(std::rand()) % numValues; +if (notNull[pos]) { + notNull[pos] = static_cast(0); + --numNulls; +} + } +} + +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) + { +if (!random) { + data[i] = start + delta * static_cast(i); +} else { + data[i] = std::rand(); +} + } +} + } + + void decodeAndVerify( + RleVersion version, + const MemoryOutputStream& memStream, + int64_t * data, + uint64_t numValues, + const char* notNull, + bool isSinged) { +std::unique_ptr decoder = createRleDecoder( +std::unique_ptr(new SeekableArrayInputStream( +memStream.getData(), +memStream.getLength())), +isSinged, version, *getDefaultPool()); + +int64_t* decodedData = new int64_t[numValues]; +decoder->next(decodedData, numValues, notNull); + +for (uint64_t i = 0; i < numValues; ++i) { + if (!notNull || notNull[i]) { +EXPECT_EQ(data[i], decodedData[i]); + } +} + +delete [] decodedData; + } + + std::unique_ptr getEncoder(RleVersion version, +MemoryOutputStream& memStream, +bool isSigned) + { +MemoryPool * pool = getDefaultPool(); + +return createRleEncoder( +std::unique_ptr( +new BufferedOutputStream(*pool, , 500 * 1024, 1024)), +isSigned, version, *pool, true); --- End diff -- can we template these tests for `alignedBitpacking = false`? ---
[GitHub] orc issue #277: ORC-372: Enable valgrind for C++ travis-ci tests
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/277 There are indeed valgrind failures. I will push a followup patch to fix these. ---
[GitHub] orc pull request #277: ORC-372: Enable valgrind for C++ travis-ci tests
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/277 ORC-372: Enable valgrind for C++ travis-ci tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-372 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/277.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #277 commit 621f6467ace049c90b3044e67c274f9d276b3a0d Author: Deepak Majeti Date: 2018-06-03T16:52:39Z ORC-372: Enable valgrind for C++ travis-ci tests ---
[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/273 The PR looks overall good to me apart from a minor change requested. This is an important patch to align the C++ and Java implementations. Thanks again for working on this! ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r192593861 --- Diff: c++/src/RLEv2.hh --- @@ -25,13 +25,89 @@ #include +#define MIN_REPEAT 3 +#define HIST_LEN 32 namespace orc { -class RleDecoderV2 : public RleDecoder { +struct FixedBitSizes { +enum FBS { +ONE = 0, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, ELEVEN, TWELVE, +THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN, NINETEEN, +TWENTY, TWENTYONE, TWENTYTWO, TWENTYTHREE, TWENTYFOUR, TWENTYSIX, +TWENTYEIGHT, THIRTY, THIRTYTWO, FORTY, FORTYEIGHT, FIFTYSIX, SIXTYFOUR, SIZE +}; +}; + +enum EncodingType { SHORT_REPEAT=0, DIRECT=1, PATCHED_BASE=2, DELTA=3 }; + +struct EncodingOption { + EncodingType encoding; + int64_t fixedDelta; + int64_t gapVsPatchListCount; + int64_t zigzagLiteralsCount; + int64_t baseRedLiteralsCount; + int64_t adjDeltasCount; + uint32_t zzBits90p; + uint32_t zzBits100p; + uint32_t brBits95p; + uint32_t brBits100p; + uint32_t bitsDeltaMax; + uint32_t patchWidth; + uint32_t patchGapWidth; + uint32_t patchLength; + int64_t min; + bool isFixedDelta; +}; + +class RleEncoderV2 : public RleEncoder { public: +RleEncoderV2(std::unique_ptr outStream, bool hasSigned, bool alignBitPacking = true); --- End diff -- `alignedBitPacking` is always true. Should we add a WriterOption to enable/disable it? Java uses the Encoding Strategy to choose this. C++ currently does not have this. ``` java/core/src/java/org/apache/orc/impl/writer/TreeWriterBase.java:144 if (writer.getEncodingStrategy().equals(OrcFile.EncodingStrategy.SPEED)) { alignedBitpacking = true; } ``` ---
[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/275 Since there is some investigation needed here, I am going to merge this patch. We can enable `NO_SASL` build in a later patch. Right now, this is causing a build failure by default. ---
[GitHub] orc pull request #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SAS...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/275 ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is not found You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-371 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #275 commit 65c537767474708a91cf5a15cae45acf4fcea552 Author: Deepak Majeti Date: 2018-05-30T21:46:08Z ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is not found ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191832950 --- Diff: c++/src/Writer.cc --- @@ -122,9 +127,17 @@ namespace orc { } WriterOptions& WriterOptions::setFileVersion(const FileVersion& version) { -// Only Hive_0_11 version is supported currently -if (version.getMajor() == 0 && version.getMinor() == 11) { +// Only Hive_0_11 and Hive_0_12 version are supported currently +if (version.getMajor() == 0 && (version.getMinor() == 11 || version.getMinor() == 12)) { --- End diff -- My suggestion is to use this logic to implement `WriterOptions::getRleVersion()`. ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191830703 --- Diff: c++/src/Writer.cc --- @@ -38,9 +38,10 @@ namespace orc { FileVersion fileVersion; double dictionaryKeySizeThreshold; bool enableIndex; +RleVersion rleVersion; --- End diff -- To be clear, do we need this `RleVersion` here and in the `WriterOptions`? ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191829402 --- Diff: c++/src/Writer.cc --- @@ -38,9 +38,10 @@ namespace orc { FileVersion fileVersion; double dictionaryKeySizeThreshold; bool enableIndex; +RleVersion rleVersion; --- End diff -- I think the file version should determine the `RleVersion`. Refer `isNewWriteFormat` and `isDirectV2` on the Java side. ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191816749 --- Diff: c++/test/TestWriter.cc --- @@ -47,7 +47,6 @@ namespace orc { const Type& type, MemoryPool* memoryPool, OutputStream* stream, - RleVersion rleVersion, FileVersion version = FileVersion(0, 12)){ --- End diff -- `FileVersion::v_0_12()` ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191816344 --- Diff: c++/test/TestWriter.cc --- @@ -139,7 +136,6 @@ namespace orc { *type, pool, , - rleVersion, FileVersion(0, 11)); --- End diff -- `FileVersion::v_0_11()` ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191815756 --- Diff: c++/test/TestWriter.cc --- @@ -1174,5 +1170,5 @@ namespace orc { } } - INSTANTIATE_TEST_CASE_P(OrcTest, WriterTest, Values(RleVersion_1, RleVersion_2)); + INSTANTIATE_TEST_CASE_P(OrcTest, WriterTest, Values(FileVersion::v_0_11(), FileVersion::v_0_11())); --- End diff -- Should be `Values(FileVersion::v_0_11(), FileVersion::v_0_12()))` ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191615840 --- Diff: c++/src/RleEncoderV2.cc --- @@ -0,0 +1,768 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with option work for additional information + * regarding copyright ownership. The ASF licenses option file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use option file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "Adaptor.hh" +#include "Compression.hh" +#include "RLEv2.hh" +#include "RLEV2Util.hh" + +#define MAX_LITERAL_SIZE 512 +#define MAX_SHORT_REPEAT_LENGTH 10 + +namespace orc { + +/** + * Compute the bits required to represent pth percentile value + * @param data - array + * @param p - percentile value (=0.0 to =1.0) + * @return pth percentile bits + */ +uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t length, double p, bool reuseHist) { +if ((p > 1.0) || (p <= 0.0)) { +throw InvalidArgument("Invalid p value: " + std::to_string(p)); +} + +if (!reuseHist) { +// histogram that store the encoded bit requirement for each values. +// maximum number of bits that can encoded is 32 (refer FixedBitSizes) +memset(histgram, 0, 32 * sizeof(int32_t)); --- End diff -- Good point! Should we just add `FixedBitSizes::SIZE` as another element and use it then? ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191291449 --- Diff: c++/src/RleEncoderV2.cc --- @@ -0,0 +1,768 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with option work for additional information + * regarding copyright ownership. The ASF licenses option file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use option file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "Adaptor.hh" +#include "Compression.hh" +#include "RLEv2.hh" +#include "RLEV2Util.hh" + +#define MAX_LITERAL_SIZE 512 +#define MAX_SHORT_REPEAT_LENGTH 10 + +namespace orc { + +/** + * Compute the bits required to represent pth percentile value + * @param data - array + * @param p - percentile value (=0.0 to =1.0) + * @return pth percentile bits + */ +uint32_t RleEncoderV2::percentileBits(int64_t* data, size_t offset, size_t length, double p, bool reuseHist) { +if ((p > 1.0) || (p <= 0.0)) { +throw InvalidArgument("Invalid p value: " + std::to_string(p)); +} + +if (!reuseHist) { +// histogram that store the encoded bit requirement for each values. +// maximum number of bits that can encoded is 32 (refer FixedBitSizes) +memset(histgram, 0, 32 * sizeof(int32_t)); --- End diff -- Use `FixedBitSizes::LAST` instead of 32? ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191291163 --- Diff: c++/src/RLEv2.hh --- @@ -25,13 +25,89 @@ #include +#define MIN_REPEAT 3 +#define HIST_LEN 32 namespace orc { -class RleDecoderV2 : public RleDecoder { +struct FixedBitSizes { +enum FBS { +ONE = 0, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, ELEVEN, TWELVE, +THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN, NINETEEN, +TWENTY, TWENTYONE, TWENTYTWO, TWENTYTHREE, TWENTYFOUR, TWENTYSIX, +TWENTYEIGHT, THIRTY, THIRTYTWO, FORTY, FORTYEIGHT, FIFTYSIX, SIXTYFOUR --- End diff -- can you add another element `LAST=SIXTYFOUR` towards the end? ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191275478 --- Diff: c++/src/Writer.cc --- @@ -38,9 +38,10 @@ namespace orc { FileVersion fileVersion; double dictionaryKeySizeThreshold; bool enableIndex; +RleVersion rleVersion; WriterOptionsPrivate() : -fileVersion(0, 11) { // default to Hive_0_11 +fileVersion(0, 12) { // default to Hive_0_12 --- End diff -- We should use the static constants proposed in PR https://github.com/apache/orc/pull/274 moving forward. ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191269083 --- Diff: c++/src/CMakeLists.txt --- @@ -179,15 +179,15 @@ set(SOURCE_FILES OrcFile.cc Reader.cc RLEv1.cc - RLEv2.cc + RleDecoderV2.cc + RleEncoderV2.cc --- End diff -- We split the Encoder and Decoder into two files for V2 and not for V1. Can we combine them into a single file for V2 as well? ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191271607 --- Diff: c++/src/RleDecoderV2.cc --- @@ -1,10 +1,10 @@ /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file + * distributed with option work for additional information --- End diff -- The Apache license header must not change. ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191267866 --- Diff: c++/include/orc/Writer.hh --- @@ -164,6 +169,16 @@ namespace orc { */ std::ostream * getErrorStream() const; +/** + * Set the RLE version. + */ +WriterOptions& setRleVersion(RleVersion version); --- End diff -- `WriterOptions& setRleVersion(const RleVersion& version);` ---
[GitHub] orc issue #274: ORC-368:[C++] Reader must return default version 0.11 instea...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/274 +1 LGTM ---
[GitHub] orc issue #273: ORC-343 Enable C++ writer to support RleV2
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/273 @xndai and @yuruiz thanks for contributing this code. I will take a look at this. ---
[GitHub] orc issue #265: ORC-334: [C++] Add AppVeyor support for integration on windo...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/265 Added an INFRA ticket to do this here https://issues.apache.org/jira/browse/INFRA-16535 ---
[GitHub] orc issue #265: ORC-334: [C++] Add AppVeyor support for integration on windo...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/265 +1 LGTM ---
[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/265#discussion_r187922554 --- Diff: c++/src/Timezone.cc --- @@ -710,7 +710,11 @@ namespace orc { * Get the local timezone. */ const Timezone& getLocalTimezone() { +#ifdef _MSC_VER +return getTimezoneByName("UTC"); --- End diff -- I also think that it is better to leave the conversion to the client/customer. We should ideally change the conversion for Nix* systems but cannot due to backward compatibility. We should be able to converge both Java and C++ timestamps in ORC 2.0. For Windows, since this is the first official build support, we should be okay to use "UTC" and document this behavior. I will merge this PR end of today if there are no objections. ---
[GitHub] orc pull request #265: ORC-334: [C++] Add AppVeyor support for integration o...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/265#discussion_r186409327 --- Diff: c++/src/Timezone.cc --- @@ -710,7 +710,11 @@ namespace orc { * Get the local timezone. */ const Timezone& getLocalTimezone() { +#ifdef _MSC_VER +return getTimezoneByName("UTC"); --- End diff -- Can you comment on why we look at `UTC` for windows instead of `LOCAL_TIMEZONE`? ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181155751 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @xndai Vertica is interested in getting RLE v2 for C++ as well. Do you think we can collaborate on getting this in quickly? ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181096194 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- shouldn't this be sint64 as well since we are using uint64 for the SECONDARY stream? ---
[GitHub] orc pull request #243: Update the site with more information about developin...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/243#discussion_r180608621 --- Diff: site/develop/committers.md --- @@ -0,0 +1,62 @@ +--- +layout: page +title: Project Members +--- + +## Project Members + +{% comment %} +please sort by Apache Id +{% endcomment %} +Name| Apache Id| Role +:-- | :--- | :--- +Aliaksei Sandryhaila| asandryh | PMC +Chris Douglas | cdouglas | PMC +Chinna Rao Lalam| chinnaraol | Committer +Chaoyu Tang | ctang| Committer +Carl Steinbach | cws | Committer +Daniel Dai | daijy| Committer +Deepak Majeti | mdeepak | PMC +Eugene Koifman | ekoifman | PMC +Gang Wu | gangwu | Committer +Alan Gates | gates| PMC +Gopal Vijayaraghavan| gopalv | PMC +Gunther Hagleitner | gunther | Committer +Ashutosh Chauhan| hashutosh| Committer +Jesus Camacho Rodriguez | jcamacho | Committer +Jason Dere | jdere| Committer +Jimmy Xiang | jxiang | Committer +Kevin Wilfong | kevinwilfong | Committer +Lars Francke| larsfrancke | Committer +Lefty Leverenz | leftyl | PMC +Rui Li | lirui| Committer +Mithun Radhakrishnan| mithun | Committer +Matthew McCline | mmccline | Committer +Naveen Gangam | ngangam | Committer +Owen O'Malley | omalley | PMC +Prasanth Jayachandran | prasanthj| PMC +Pengcheng Xiong | pxiong | Committer +Rajesh Balamohan| rbalamohan | Committer +Sergey Shelukhin| sershe | Committer +Sergio Pena | spena| Committer +Siddharth Seth | sseth| Committer +Stephen Walkauskas | swalkaus | Committer +Vaibhav Gumashta| vgumashta| Committer +Wei Zheng | weiz | Committer +Xiening Dai | xndai| Committer +Xuefu Zhang | xuefu| Committer +Ferdinand Xu| xuf | Committer +Yongzhi Chen| ychena | Committer +Aihua Xu| zihuaxu | Committer + +Companies with employees that are committers: + +* Alibaba +* Cloudera +* Facebook +* Hewlett Packard Enterprise --- End diff -- Can we add Vertica instead of HPE? ---
[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/233 You are right about the last commit. It is redundant! I will remove that commit and merge this. Thanks! ---
[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/233 Can we extend this by adding a `WriterOption` to provide a timezone name (default to "GMT")? ---
[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/233 I also think that we should adjust the value being read from the C++ reader w.r.t to the reader and writer timezones (if they are different) like the Java reader implementation. The current C++ behavior is definitely inconsistent with Java (which does things the right way). ---
[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/233 @wgtmac and @stiga-huang you are right that the C++ and Java writers must write the same value to a file for a given input timestamp value. Looks like the Java side writes the timestamp values provided as is in local time (no conversion) and writes the writer timezone in the footer (however, stats are in UTC). We must do the same for the C++ writer as well if not already. ORC-10 adds GMT offset when reading the values back. Therefore, the C++ reader always returns values in UTC. The current behavior of ORC reader for timestamp values is the same as SQL `TimestampTz`. To get the same values back ( aka SQL `Timestamp`), you need to convert the values read back to local time. If you read a timestamp column from an ORC file and plan to write it immediately, you must first convert the values to the local time before writing. ---
[GitHub] orc issue #225: ORC-313,ORC-317: Check types in footer
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/225 @stiga-huang I will look into these this week. Can you fix the title of this PR? Thanks! ---
[GitHub] orc issue #225: ORC-313,ORC-317: Check types in footer
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/225 @stiga-huang can you open a separate PR or each JIRA? Thanks. ---
[GitHub] orc issue #216: ORC-284: [C++] add missing tests for C++ tools
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/216 +1 LGTM. Thanks @wgtmac ! ---
[GitHub] orc issue #199: ORC-276: [C++] Create a simple tool to import CSV files
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/199 Looks good! Thank you! ---
[GitHub] orc issue #199: ORC-276: [C++] Create a simple tool to import CSV files
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/199 I just realized we need to add tests for each tool as well. `/tools/test` has some examples. Some of the tools are missing tests as well. I will file a JIRA to cover those. Sorry for not noticing this earlier. ---
[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/199#discussion_r158422184 --- Diff: tools/src/CSVFileImport.cc --- @@ -0,0 +1,476 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Exceptions.hh" +#include "orc/OrcFile.hh" + +#include +#include +#include +#include +#include +#include +#include +#include + +static char gDelimiter = ','; + +// extract one column raw text from one line +std::string extractColumn(std::string s, uint64_t colIndex) { + uint64_t col = 0; + size_t start = 0; + size_t end = s.find(gDelimiter); + while (col < colIndex && end != std::string::npos) { +start = end + 1; +end = s.find(gDelimiter, start); +++col; + } + return col == colIndex ? s.substr(start, end - start) : ""; +} + +static const char* GetDate(void) { + static char buf[200]; + time_t t = time(NULL); + struct tm* p = localtime(); + strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p); + return buf; +} + +void fillLongValues(const std::vector& data, +orc::ColumnVectorBatch* batch, +uint64_t numValues, +uint64_t colIndex) { + orc::LongVectorBatch* longBatch = +dynamic_cast<orc::LongVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + longBatch->data[i] = atoll(col.c_str()); +} + } + longBatch->hasNulls = hasNull; + longBatch->numElements = numValues; +} + +void fillStringValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex, + orc::DataBuffer& buffer, + uint64_t& offset) { + orc::StringVectorBatch* stringBatch = +dynamic_cast<orc::StringVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + if (buffer.size() - offset < col.size()) { +buffer.reserve(buffer.size() * 2); + } + memcpy(buffer.data() + offset, + col.c_str(), + col.size()); + stringBatch->data[i] = buffer.data() + offset; + stringBatch->length[i] = static_cast(col.size()); + offset += col.size(); +} + } + stringBatch->hasNulls = hasNull; + stringBatch->numElements = numValues; +} + +void fillDoubleValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex) { + orc::DoubleVectorBatch* dblBatch = +dynamic_cast<orc::DoubleVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + dblBatch->data[i] = atof(col.c_str()); +} + } + dblBatch->hasNulls = hasNull; + dblBatch->numElements = numValues; +} + +// parse fixed point decimal numbers +void fillDecimalValues(const std::vector& data, + orc::ColumnVectorBatch* batch, +
[GitHub] orc pull request #204: ORC-283: Enable the cmake build to pick specified pat...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/204#discussion_r158269151 --- Diff: cmake_modules/FindGTest.cmake --- @@ -28,7 +28,7 @@ find_path (GTEST_INCLUDE_DIR gmock/gmock.h HINTS NO_DEFAULT_PATH PATH_SUFFIXES "include") -find_library (GTEST_LIBRARIES NAMES gmock PATHS +find_library (GTEST_LIBRARIES NAMES gmock HINTS --- End diff -- `HINTS` is apt here. `PATHS` must only be used for hardcoded guesses. https://cmake.org/cmake/help/v3.0/command/find_library.html ---
[GitHub] orc pull request #204: ORC-283: Enable the cmake build to pick specified lib...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/204 ORC-283: Enable the cmake build to pick specified libraries over the ⦠â¦default libraries You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-283 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/204.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #204 commit 1a34380e7eb6323e3f0b92b377943f1ba5562d5e Author: Deepak Majeti <mdeepak@...> Date: 2017-12-21T12:17:12Z ORC-283: Enable the cmake build to pick specified libraries over the default libraries ---
[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/199#discussion_r155629237 --- Diff: tools/src/CSVFileImport.cc --- @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Exceptions.hh" +#include "orc/OrcFile.hh" + +#include +#include +#include +#include +#include +#include +#include + +static char gDelimiter = ','; + +std::string extractColumn(std::string s, uint64_t colIndex) { + uint64_t col = 0; + size_t start = 0; + size_t end = s.find(gDelimiter); + while (col < colIndex && end != std::string::npos) { +start = end + 1; +end = s.find(gDelimiter, start); +++col; + } + return col == colIndex ? s.substr(start, end - start) : ""; +} + +static const char* GetDate(void) +{ + static char buf[200]; + time_t t = time(NULL); + struct tm* p = localtime(); + strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p); + return buf; +} + +void fillLongValues(const std::vector& data, +orc::ColumnVectorBatch* batch, +uint64_t numValues, +uint64_t colIndex) { + orc::LongVectorBatch* longBatch = +dynamic_cast<orc::LongVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + longBatch->data[i] = atoll(col.c_str()); +} + } + longBatch->hasNulls = hasNull; + longBatch->numElements = numValues; +} + +void fillStringValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex, + orc::DataBuffer& buffer, + uint64_t& offset) { + orc::StringVectorBatch* stringBatch = +dynamic_cast<orc::StringVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + if (buffer.size() - offset < col.size()) { +buffer.reserve(buffer.size() * 2); + } + memcpy(buffer.data() + offset, + col.c_str(), + col.size()); + stringBatch->data[i] = buffer.data() + offset; + stringBatch->length[i] = static_cast(col.size()); + offset += col.size(); +} + } + stringBatch->hasNulls = hasNull; + stringBatch->numElements = numValues; +} + +void fillDoubleValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex) { + orc::DoubleVectorBatch* dblBatch = +dynamic_cast<orc::DoubleVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + dblBatch->data[i] = atof(col.c_str()); +} + } + dblBatch->hasNulls = hasNull; + dblBatch->numElements = numValues; +} + +// parse fixed point decimal numbers +void fillDecimalValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, +
[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/199#discussion_r155995744 --- Diff: tools/src/CSVFileImport.cc --- @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Exceptions.hh" +#include "orc/OrcFile.hh" + +#include +#include +#include +#include +#include +#include +#include + +static char gDelimiter = ','; + +std::string extractColumn(std::string s, uint64_t colIndex) { + uint64_t col = 0; + size_t start = 0; + size_t end = s.find(gDelimiter); + while (col < colIndex && end != std::string::npos) { +start = end + 1; +end = s.find(gDelimiter, start); +++col; + } + return col == colIndex ? s.substr(start, end - start) : ""; +} + +static const char* GetDate(void) +{ + static char buf[200]; + time_t t = time(NULL); + struct tm* p = localtime(); + strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p); + return buf; +} + +void fillLongValues(const std::vector& data, +orc::ColumnVectorBatch* batch, +uint64_t numValues, +uint64_t colIndex) { + orc::LongVectorBatch* longBatch = +dynamic_cast<orc::LongVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + longBatch->data[i] = atoll(col.c_str()); +} + } + longBatch->hasNulls = hasNull; + longBatch->numElements = numValues; +} + +void fillStringValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex, + orc::DataBuffer& buffer, + uint64_t& offset) { + orc::StringVectorBatch* stringBatch = +dynamic_cast<orc::StringVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + if (buffer.size() - offset < col.size()) { +buffer.reserve(buffer.size() * 2); + } + memcpy(buffer.data() + offset, + col.c_str(), + col.size()); + stringBatch->data[i] = buffer.data() + offset; + stringBatch->length[i] = static_cast(col.size()); + offset += col.size(); +} + } + stringBatch->hasNulls = hasNull; + stringBatch->numElements = numValues; +} + +void fillDoubleValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex) { + orc::DoubleVectorBatch* dblBatch = +dynamic_cast<orc::DoubleVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + dblBatch->data[i] = atof(col.c_str()); +} + } + dblBatch->hasNulls = hasNull; + dblBatch->numElements = numValues; +} + +// parse fixed point decimal numbers +void fillDecimalValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, +
[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/199#discussion_r155629518 --- Diff: tools/src/CSVFileImport.cc --- @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Exceptions.hh" +#include "orc/OrcFile.hh" + +#include +#include +#include +#include +#include +#include +#include + +static char gDelimiter = ','; + +std::string extractColumn(std::string s, uint64_t colIndex) { + uint64_t col = 0; + size_t start = 0; + size_t end = s.find(gDelimiter); + while (col < colIndex && end != std::string::npos) { +start = end + 1; +end = s.find(gDelimiter, start); +++col; + } + return col == colIndex ? s.substr(start, end - start) : ""; +} + +static const char* GetDate(void) +{ + static char buf[200]; + time_t t = time(NULL); + struct tm* p = localtime(); + strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p); + return buf; +} + +void fillLongValues(const std::vector& data, +orc::ColumnVectorBatch* batch, +uint64_t numValues, +uint64_t colIndex) { + orc::LongVectorBatch* longBatch = +dynamic_cast<orc::LongVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + longBatch->data[i] = atoll(col.c_str()); +} + } + longBatch->hasNulls = hasNull; + longBatch->numElements = numValues; +} + +void fillStringValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex, + orc::DataBuffer& buffer, + uint64_t& offset) { + orc::StringVectorBatch* stringBatch = +dynamic_cast<orc::StringVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + if (buffer.size() - offset < col.size()) { +buffer.reserve(buffer.size() * 2); + } + memcpy(buffer.data() + offset, + col.c_str(), + col.size()); + stringBatch->data[i] = buffer.data() + offset; + stringBatch->length[i] = static_cast(col.size()); + offset += col.size(); +} + } + stringBatch->hasNulls = hasNull; + stringBatch->numElements = numValues; +} + +void fillDoubleValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex) { + orc::DoubleVectorBatch* dblBatch = +dynamic_cast<orc::DoubleVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + dblBatch->data[i] = atof(col.c_str()); +} + } + dblBatch->hasNulls = hasNull; + dblBatch->numElements = numValues; +} + +// parse fixed point decimal numbers +void fillDecimalValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, +
[GitHub] orc pull request #199: ORC-276: [C++] Create a simple tool to import CSV fil...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/199#discussion_r155629814 --- Diff: tools/src/CSVFileImport.cc --- @@ -0,0 +1,436 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Exceptions.hh" +#include "orc/OrcFile.hh" + +#include +#include +#include +#include +#include +#include +#include + +static char gDelimiter = ','; + +std::string extractColumn(std::string s, uint64_t colIndex) { + uint64_t col = 0; + size_t start = 0; + size_t end = s.find(gDelimiter); + while (col < colIndex && end != std::string::npos) { +start = end + 1; +end = s.find(gDelimiter, start); +++col; + } + return col == colIndex ? s.substr(start, end - start) : ""; +} + +static const char* GetDate(void) +{ + static char buf[200]; + time_t t = time(NULL); + struct tm* p = localtime(); + strftime(buf, sizeof(buf), "[%Y-%m-%d %H:%M:%S]", p); + return buf; +} + +void fillLongValues(const std::vector& data, +orc::ColumnVectorBatch* batch, +uint64_t numValues, +uint64_t colIndex) { + orc::LongVectorBatch* longBatch = +dynamic_cast<orc::LongVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + longBatch->data[i] = atoll(col.c_str()); +} + } + longBatch->hasNulls = hasNull; + longBatch->numElements = numValues; +} + +void fillStringValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex, + orc::DataBuffer& buffer, + uint64_t& offset) { + orc::StringVectorBatch* stringBatch = +dynamic_cast<orc::StringVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + if (buffer.size() - offset < col.size()) { +buffer.reserve(buffer.size() * 2); + } + memcpy(buffer.data() + offset, + col.c_str(), + col.size()); + stringBatch->data[i] = buffer.data() + offset; + stringBatch->length[i] = static_cast(col.size()); + offset += col.size(); +} + } + stringBatch->hasNulls = hasNull; + stringBatch->numElements = numValues; +} + +void fillDoubleValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, + uint64_t colIndex) { + orc::DoubleVectorBatch* dblBatch = +dynamic_cast<orc::DoubleVectorBatch*>(batch); + bool hasNull = false; + for (uint64_t i = 0; i < numValues; ++i) { +std::string col = extractColumn(data[i], colIndex); +if (col.empty()) { + batch->notNull[i] = 0; + hasNull = true; +} else { + batch->notNull[i] = 1; + dblBatch->data[i] = atof(col.c_str()); +} + } + dblBatch->hasNulls = hasNull; + dblBatch->numElements = numValues; +} + +// parse fixed point decimal numbers +void fillDecimalValues(const std::vector& data, + orc::ColumnVectorBatch* batch, + uint64_t numValues, +
[GitHub] orc issue #196: ORC-270: fix target_link_libraries for tool-test
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/196 Thanks for the patch! ---
[GitHub] orc pull request #195: ORC-269: cmake fails when PROTOBUF_HOME set and libhd...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/195 ORC-269: cmake fails when PROTOBUF_HOME set and libhdfs is built You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-269 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/195.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #195 commit b4fecf9bdd84cd926d6af8e65903f876ebb3b11f Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-11-30T01:46:36Z ORC-269: cmake fails when PROTOBUF_HOME set and libhdfs is built ---
[GitHub] orc pull request #191: ORC-265: [C++] Add documentation for C++ build suppor...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/191#discussion_r153647593 --- Diff: site/_docs/building.md --- @@ -70,3 +70,25 @@ To build: % mvn package ~~~ +## Building just C++ + +~~~ shell +% mkdir build +% cd build +% cmake .. -DBUILD_JAVA=OFF +% make package test-out +~~~ + +## Specify third-party libraries for C++ build + +~~~ shell +% mkdir build +% cd build +% GTEST_HOME= \ + SNAPPY_HOME= \ + ZLIB_HOME= \ + LZ4_HOME= \ + PROTOBUF_HOME= \ + cmake .. -DBUILD_JAVA=OFF +% make package test-out --- End diff -- That is correct. The recent changes made did put the dependency on CMake variables. Fixed the comments. ---
[GitHub] orc pull request #191: ORC-265: [C++] Add documentation for C++ build suppor...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/191#discussion_r153607648 --- Diff: site/_docs/building.md --- @@ -70,3 +70,25 @@ To build: % mvn package ~~~ +## Building just C++ + +~~~ shell +% mkdir build +% cd build +% cmake .. -DBUILD_JAVA=OFF +% make package test-out +~~~ + +## Specify third-party libraries for C++ build + +~~~ shell +% mkdir build +% cd build +% GTEST_HOME= \ + SNAPPY_HOME= \ + ZLIB_HOME= \ + LZ4_HOME= \ + PROTOBUF_HOME= \ + cmake .. -DBUILD_JAVA=OFF +% make package test-out --- End diff -- @wgtmac That is correct. We manually set it in travis-ci testing as well. ``cmake -DOPENSSL_ROOT_DIR=`brew --prefix openssl` ..`` I will check with Anatoli Shein if we can fix this / or add documentation here. Thanks! ---
[GitHub] orc pull request #192: Cleanup cmake scripts
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/192#discussion_r152657567 --- Diff: cmake_modules/ThirdpartyToolchain.cmake --- @@ -41,32 +68,35 @@ if (NOT SNAPPY_FOUND) LOG_BUILD 1 LOG_INSTALL 1 BUILD_BYPRODUCTS "${SNAPPY_STATIC_LIB}") + + set (SNAPPY_VENDORED TRUE) endif () -include_directories (SYSTEM ${SNAPPY_INCLUDE_DIRS}) + +include_directories (SYSTEM ${SNAPPY_INCLUDE_DIR}) add_library (snappy STATIC IMPORTED) set_target_properties (snappy PROPERTIES IMPORTED_LOCATION ${SNAPPY_STATIC_LIB}) -set (SNAPPY_LIBRARIES snappy) -add_dependencies (snappy snappy_ep) -install(DIRECTORY ${SNAPPY_PREFIX}/lib DESTINATION . --- End diff -- AFAIK `cpack` depends on `install` as well to make a package. So I would vote for your other option of using `INSTALL_THIRDPARTY_LIBS=on`. ---
[GitHub] orc pull request #192: Cleanup cmake scripts
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/192#discussion_r152654881 --- Diff: cmake_modules/ThirdpartyToolchain.cmake --- @@ -10,19 +10,46 @@ # See the License for the specific language governing permissions and # limitations under the License. +set (LZ4_VERSION "1.7.5") +set (SNAPPY_VERSION "1.1.4") +set (ZLIB_VERSION "1.2.11") +set (GTEST_VERSION "1.8.0") +set (PROTOBUF_VERSION "2.6.0") + set (THIRDPARTY_DIR "${CMAKE_BINARY_DIR}/c++/libs/thirdparty") string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_BUILD_TYPE) +if (DEFINED ENV{SNAPPY_HOME}) + set (SNAPPY_HOME "$ENV{SNAPPY_HOME}") +endif () + +if (DEFINED ENV{ZLIB_HOME}) + set (ZLIB_HOME "$ENV{ZLIB_HOME}") +endif () + +if (DEFINED ENV{LZ4_HOME}) + set (LZ4_HOME "$ENV{LZ4_HOME}") +endif () + +if (DEFINED ENV{PROTOBUF_HOME}) + set (PROTOBUF_HOME "$ENV{PROTOBUF_HOME}") +endif () + +if (DEFINED ENV{GTEST_HOME}) + set (GTEST_HOME "$ENV{GTEST_HOME}") +endif () + # -- # Snappy -set (SNAPPY_HOME "$ENV{SNAPPY_HOME}") -find_package (Snappy) -if (NOT SNAPPY_FOUND) +if (NOT "${SNAPPY_HOME}" STREQUAL "") + find_package (Snappy REQUIRED) + set(SNAPPY_VENDORED FALSE) +else () --- End diff -- I would like the library to be vendored in the case where `SNAPPY_HOME` is set, but `SNAPPY_FOUND` is inferred as `false` ---
[GitHub] orc pull request #191: ORC-265: [C++] Add documentation for C++ build suppor...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/191 ORC-265: [C++] Add documentation for C++ build support You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-265 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/191.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #191 commit f13da19acdbf27a9310578b97ec133d11f0acc28 Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-11-15T22:56:56Z ORC-265: [C++] Add documentation for C++ build support ---
[GitHub] orc issue #170: ORC-207: [C++] Enable users the ability to provide their own...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/170 @jcrist Will review that PR. Thanks! ---
[GitHub] orc issue #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/149 @wgtmac Sorry for the delay. I was tied up last week. I will definitely work on this today/tomorrow. ---
[GitHub] orc issue #183: ORC-258: [C++] Incorrect Decimal constructor
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/183 @wgtmac will do. I will make another pass today and merge it. Thanks! ---
[GitHub] orc issue #170: ORC-207: [C++] Enable users the ability to provide their own...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/170 In this PR, `FindProtobuf.cmake` checks if the specified path `PROTOBUF_HOME` contains the required headers and sets `PROTOBUF_INCLUDE_DIRS`, checks for the library and sets `PROTOBUF_LIBRARIES`, and checks for the executable and sets `PROTOBUF_EXECUTABLE`. Even if one of them is missing from the user specified location, the protobuf library gets downloaded. In a future patch, we can extend the `FindX.cmake` to check default system paths (except for protobuf). But by default, we should only download. ---
[GitHub] orc issue #170: ORC-207: [C++] Enable users the ability to provide their own...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/170 I did not understand the first point. `PROTOBUF_HOME` must be explicitly set to use the user specified version. Otherwise, the library gets downloaded. Will remove the travis-ci test. Thanks! ---
[GitHub] orc pull request #170: ORC-207: [C++] Enable users the ability to provide th...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/170 ORC-207: [C++] Enable users the ability to provide their own thirdparty libraries You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-207 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/170.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #170 commit aba65fc12641cf30098c850bf29ca7032023e66a Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-09-19T18:37:29Z ORC-207: Enable users the ability to provide third-party libraries ---
[GitHub] orc issue #168: Install missing Statistics.hh header file
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/168 Can you file a JIRA and append the JIRA number to the PR? Thanks. ---
[GitHub] orc issue #151: ORC-226 Support getWriterId in c++ reader interface
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/151 @xndai can you please squash your commits? ---
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134319378 --- Diff: c++/test/TestWriter.cc --- @@ -209,5 +209,612 @@ namespace orc { } EXPECT_FALSE(rowReader->next(*batch)); } -} + TEST(Writer, writeStringAndBinaryColumn) { --- End diff -- See google typed tests. Those should help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134308560 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134308863 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134304957 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134309488 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134301225 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134311129 --- Diff: c++/test/TestWriter.cc --- @@ -209,5 +209,612 @@ namespace orc { } EXPECT_FALSE(rowReader->next(*batch)); } -} + TEST(Writer, writeStringAndBinaryColumn) { --- End diff -- These tests can definitely be improved by using a templated test class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r134310213 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc issue #134: Orc 17
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/134 Can you check if `thread_local` is supported by the platform, and then disable libhdfspp on those platforms? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #134: Orc 17
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/134 I am adding a Travis test for OSX os. It should ease catching OSX issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/149 @wgtmac Sorry for taking longer. I will definitely try to complete this as soon as possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #155: ORC-227: Fix ExternalProject_Add
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/155 ORC-227: Fix ExternalProject_Add @omalley This commit got left out from the previous PR You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-227 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/155.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #155 commit bceaecd8d0195e9ef27adbdf41515f1900730652 Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-08-10T14:37:32Z Fix ExternalProject_Add --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #152: ORC-227: [C++] Fix docker failure due to ExternalProj...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/152#discussion_r132529457 --- Diff: c++/src/ByteRLE.cc --- @@ -26,9 +26,9 @@ namespace orc { - const size_t MINIMUM_REPEAT = 3; - const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT; - const size_t MAX_LITERAL_SIZE = 128; + const int MINIMUM_REPEAT = 3; + const int MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT; + const int MAX_LITERAL_SIZE = 128; --- End diff -- I tried CentOS6 docker image. The warnings seem to be correct. ``` cc1plus: warnings being treated as errors /root/orc/c++/src/ByteRLE.cc: In member function âvoid orc::ByteRleEncoderImpl::write(char)â: /root/orc/c++/src/ByteRLE.cc:152: error: comparison between signed and unsigned integer expressions /root/orc/c++/src/ByteRLE.cc:166: error: comparison between signed and unsigned integer expressions /root/orc/c++/src/ByteRLE.cc:167: error: comparison between signed and unsigned integer expressions /root/orc/c++/src/ByteRLE.cc:179: error: comparison between signed and unsigned integer expressions make[2]: *** [c++/src/CMakeFiles/orc.dir/ByteRLE.cc.o] Error 1 make[1]: *** [c++/src/CMakeFiles/orc.dir/all] Error 2 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #152: ORC-227: [C++] Fix docker failure due to ExternalProj...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/152#discussion_r132477368 --- Diff: c++/src/ByteRLE.cc --- @@ -26,9 +26,9 @@ namespace orc { - const size_t MINIMUM_REPEAT = 3; - const size_t MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT; - const size_t MAX_LITERAL_SIZE = 128; + const int MINIMUM_REPEAT = 3; + const int MAXIMUM_REPEAT = 127 + MINIMUM_REPEAT; + const int MAX_LITERAL_SIZE = 128; --- End diff -- @xndai @wgtmac These changes are required to avoid signed and unsigned mismatch warnings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #149: ORC-224: Implement column writers of primitive types
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/149 @wgtmac I will take a look at this today/tomorrow. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #143: [C++] Remove gmock and protobuf libraries from source...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/143 [C++] Remove gmock and protobuf libraries from source and use ExternalProject instead Similar to ORC-204 Files modified ``` c++/CMakeLists.txt c++/src/CMakeLists.txt c++/test/CMakeLists.txt cmake_modules/ThirdpartyToolchain.cmake tools/test/CMakeLists.txt CMakeLists.txt ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-215 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/143.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #143 commit e92ea33aa33e9a2906941d3165476b836c09af92 Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-07-27T16:55:40Z ORC-215: Use ExternalProject_Add for gmock commit 350e8648e4025498a0cb411825c9dcf2ace89994 Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-07-27T18:17:26Z use ExternalProject_Add for protobuf --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #142: [ORC-218] Cache timezone information in the library.
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/142 Please add another travis-ci test with EMBEDDED_TZ_DB=ON to ensure this change gets tested --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #135: ORC-204: Update and use CMake ExternalProject_Add to build c...
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/135 Fixed packaging static compression libraries. Only the source is available for Snappy 1.1.6 and I could not `install`(build works fine) this release. CMake support was introduced but only shared libraries are being built. Snappy 1.1.5 and 1.1.6 are the same. There is no performance difference between the current 1.1.4 and 1.1.6 https://github.com/google/snappy/releases --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #135: Update and use CMake External Project to build compre...
GitHub user majetideepak opened a pull request: https://github.com/apache/orc/pull/135 Update and use CMake External Project to build compression libraries Including the whole source of external libraries adds bloat. It also is not useful if clients prefer to use their own third-party libraries. In this PR 1) The libraries are updated to the most recent ones. 2) CMake `ExternalProject_Add` is used to build the libraries Files to review ``` cmake_modules/ThirdpartyToolchain.cmake c++/libs/CMakeLists.txt CMakeLists.txt ``` Rest of the changes are just deletes of the libraries I will file a JIRA to enable users the ability to provide their own libraries. You can merge this pull request into a Git repository by running: $ git pull https://github.com/majetideepak/orc ORC-204 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/135.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #135 commit 0bc85be0a7ba60dc53ab402a5a278503cf98f97d Author: Deepak Majeti <deepak.maj...@hpe.com> Date: 2017-07-10T16:43:47Z Update and use CMake External Project to build compression libraries --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #134: Orc 17
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/134 Can we add libhdfspp as a tarball and add a line to untar it? I am going to make these changes for other libraries in the lib folder as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/128 @xndai Unfortunately, I don't have access to merge yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120745645 --- Diff: c++/src/Writer.cc --- @@ -0,0 +1,659 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Common.hh" +#include "orc/OrcFile.hh" + +#include "ColumnWriter.hh" +#include "Timezone.hh" + +#include + +namespace orc { + + struct WriterOptionsPrivate { +uint64_t stripeSize; +uint64_t blockSize; +uint64_t rowIndexStride; +uint64_t bufferSize; +bool blockPadding; +CompressionKind compression; +EncodingStrategy encodingStrategy; +CompressionStrategy compressionStrategy; +MemoryPool* memoryPool; +WriterVersion version; +double paddingTolerance; +std::ostream* errorStream; +RleVersion rleVersion; +double dictionaryKeySizeThreshold; +bool enableStats; +bool enableStrStatsCmp; +bool enableIndex; +const Timezone* timezone; + +WriterOptionsPrivate() { + stripeSize = 64 * 1024 * 1024; // 64M + blockSize = 256 * 1024; // 256K + rowIndexStride = 1; + bufferSize = 4 * 1024 * 1024; // 4M + blockPadding = false; + compression = CompressionKind_ZLIB; + encodingStrategy = EncodingStrategy_SPEED; + compressionStrategy = CompressionStrategy_SPEED; + memoryPool = getDefaultPool(); + version = WriterVersion_ORC_135; + paddingTolerance = 0.0; + errorStream = ::cerr; + rleVersion = RleVersion_1; + dictionaryKeySizeThreshold = 0.0; + enableStats = true; + enableStrStatsCmp = false; + enableIndex = true; + timezone = (); +} + }; + + WriterOptions::WriterOptions(): +privateBits(std::unique_ptr +(new WriterOptionsPrivate())) { +// PASS + } + + WriterOptions::WriterOptions(const WriterOptions& rhs): +privateBits(std::unique_ptr +(new WriterOptionsPrivate(*(rhs.privateBits.get() { +// PASS + } + + WriterOptions::WriterOptions(WriterOptions& rhs) { +// swap privateBits with rhs +WriterOptionsPrivate* l = privateBits.release(); +privateBits.reset(rhs.privateBits.release()); +rhs.privateBits.reset(l); + } + + WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) { +if (this != ) { + privateBits.reset(new WriterOptionsPrivate(*(rhs.privateBits.get(; +} +return *this; + } + + WriterOptions::~WriterOptions() { +// PASS + } + + WriterOptions& WriterOptions::setStripeSize(uint64_t size) { +privateBits->stripeSize = size; +return *this; + } + + uint64_t WriterOptions::getStripeSize() const { +return privateBits->stripeSize; + } + + WriterOptions& WriterOptions::setBlockSize(uint64_t size) { +privateBits->blockSize = size; +return *this; + } + + uint64_t WriterOptions::getBlockSize() const { +return privateBits->blockSize; + } + + WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) { +privateBits->rowIndexStride = stride; +return *this; + } + + uint64_t WriterOptions::getRowIndexStride() const { +return privateBits->rowIndexStride; + } + + WriterOptions& WriterOptions::setBufferSize(uint64_t size) { +privateBits->bufferSize = size; +return *this; + } + + uint64_t WriterOptions::getBufferSize() const { +return privateBits->bufferSize; + } + + WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) { +privateBits->dictionaryKeySizeThreshold = val; +re
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120728928 --- Diff: c++/src/Writer.cc --- @@ -0,0 +1,659 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Common.hh" +#include "orc/OrcFile.hh" + +#include "ColumnWriter.hh" +#include "Timezone.hh" + +#include + +namespace orc { + + struct WriterOptionsPrivate { +uint64_t stripeSize; +uint64_t blockSize; +uint64_t rowIndexStride; +uint64_t bufferSize; +bool blockPadding; +CompressionKind compression; +EncodingStrategy encodingStrategy; +CompressionStrategy compressionStrategy; +MemoryPool* memoryPool; +WriterVersion version; +double paddingTolerance; +std::ostream* errorStream; +RleVersion rleVersion; +double dictionaryKeySizeThreshold; +bool enableStats; +bool enableStrStatsCmp; +bool enableIndex; +const Timezone* timezone; + +WriterOptionsPrivate() { + stripeSize = 64 * 1024 * 1024; // 64M + blockSize = 256 * 1024; // 256K + rowIndexStride = 1; + bufferSize = 4 * 1024 * 1024; // 4M + blockPadding = false; + compression = CompressionKind_ZLIB; + encodingStrategy = EncodingStrategy_SPEED; + compressionStrategy = CompressionStrategy_SPEED; + memoryPool = getDefaultPool(); + version = WriterVersion_ORC_135; + paddingTolerance = 0.0; + errorStream = ::cerr; + rleVersion = RleVersion_1; + dictionaryKeySizeThreshold = 0.0; + enableStats = true; + enableStrStatsCmp = false; + enableIndex = true; + timezone = (); +} + }; + + WriterOptions::WriterOptions(): +privateBits(std::unique_ptr +(new WriterOptionsPrivate())) { +// PASS + } + + WriterOptions::WriterOptions(const WriterOptions& rhs): +privateBits(std::unique_ptr +(new WriterOptionsPrivate(*(rhs.privateBits.get() { +// PASS + } + + WriterOptions::WriterOptions(WriterOptions& rhs) { +// swap privateBits with rhs +WriterOptionsPrivate* l = privateBits.release(); +privateBits.reset(rhs.privateBits.release()); +rhs.privateBits.reset(l); + } + + WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) { +if (this != ) { + privateBits.reset(new WriterOptionsPrivate(*(rhs.privateBits.get(; +} +return *this; + } + + WriterOptions::~WriterOptions() { +// PASS + } + + WriterOptions& WriterOptions::setStripeSize(uint64_t size) { +privateBits->stripeSize = size; +return *this; + } + + uint64_t WriterOptions::getStripeSize() const { +return privateBits->stripeSize; + } + + WriterOptions& WriterOptions::setBlockSize(uint64_t size) { +privateBits->blockSize = size; +return *this; + } + + uint64_t WriterOptions::getBlockSize() const { +return privateBits->blockSize; + } + + WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) { +privateBits->rowIndexStride = stride; +return *this; + } + + uint64_t WriterOptions::getRowIndexStride() const { +return privateBits->rowIndexStride; + } + + WriterOptions& WriterOptions::setBufferSize(uint64_t size) { +privateBits->bufferSize = size; +return *this; + } + + uint64_t WriterOptions::getBufferSize() const { +return privateBits->bufferSize; + } + + WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) { +privateBits->dictionaryKeySizeThreshold = val; +re
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120729334 --- Diff: c++/include/orc/Writer.hh --- @@ -0,0 +1,294 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef ORC_WRITER_HH +#define ORC_WRITER_HH + +#include "orc/Common.hh" +#include "orc/orc-config.hh" +#include "orc/Type.hh" +#include "orc/Vector.hh" + +#include +#include +#include + +namespace orc { + + // classes that hold data members so we can maintain binary compatibility + struct WriterOptionsPrivate; + + enum EncodingStrategy { +EncodingStrategy_SPEED = 0, +EncodingStrategy_COMPRESSION + }; + + enum CompressionStrategy { +CompressionStrategy_SPEED = 0, +CompressionStrategy_COMPRESSION + }; + + enum RleVersion { +RleVersion_1, +RleVersion_2 + }; + + class Timezone; + + /** + * Options for creating a Writer. + */ + class WriterOptions { + private: +ORC_UNIQUE_PTR privateBits; + + public: +WriterOptions(); +WriterOptions(const WriterOptions&); +WriterOptions(WriterOptions&); +WriterOptions& operator=(const WriterOptions&); +virtual ~WriterOptions(); + +/** + * Set the strip size. + */ +WriterOptions& setStripeSize(uint64_t size); + +/** + * Get the strip size. + * @return if not set, return default value. + */ +uint64_t getStripeSize() const; + +/** + * Set the block size. + */ +WriterOptions& setBlockSize(uint64_t size); + +/** + * Get the block size. + * @return if not set, return default value. + */ +uint64_t getBlockSize() const; + +/** + * Set row index stride. + */ +WriterOptions& setRowIndexStride(uint64_t stride); + +/** + * Get the index stride size. + * @return if not set, return default value. + */ +uint64_t getRowIndexStride() const; + +/** + * Set the buffer size. + */ +WriterOptions& setBufferSize(uint64_t size); + +/** + * Get the buffer size. + * @return if not set, return default value. + */ +uint64_t getBufferSize() const; + +/** + * Set the dictionary key size threshold. + * 0 to disable dictionary encoding. + * 1 to always enable dictionary encoding. + */ +WriterOptions& setDictionaryKeySizeThreshold(double val); + +/** + * Get the dictionary key size threshold. + */ +double getDictionaryKeySizeThreshold() const; + +/** + * Set whether or not to have block padding. + */ +WriterOptions& setBlockPadding(bool padding); + +/** + * Get whether or not to have block padding. + * @return if not set, return default value which is false. + */ +bool getBlockPadding() const; + +/** + * Set Run length encoding version + */ +WriterOptions& setRleVersion(RleVersion version); + +/** + * Get Run Length Encoding version + */ +RleVersion getRleVersion() const; + +/** + * Set compression kind. + */ +WriterOptions& setCompression(CompressionKind comp); + +/** + * Get the compression kind. + * @return if not set, return default value which is ZLIB. + */ +CompressionKind getCompression() const; + +/** + * Set the encoding strategy. + */ +WriterOptions& setEncodingStrategy(EncodingStrategy strategy); + +/** + * Get the encoding strategy. + * @return if
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120483146 --- Diff: c++/include/orc/Writer.hh --- @@ -0,0 +1,294 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef ORC_WRITER_HH +#define ORC_WRITER_HH + +#include "orc/Common.hh" +#include "orc/orc-config.hh" +#include "orc/Type.hh" +#include "orc/Vector.hh" + +#include +#include +#include + +namespace orc { + + // classes that hold data members so we can maintain binary compatibility + struct WriterOptionsPrivate; + + enum EncodingStrategy { +EncodingStrategy_SPEED = 0, +EncodingStrategy_COMPRESSION + }; + + enum CompressionStrategy { +CompressionStrategy_SPEED = 0, +CompressionStrategy_COMPRESSION + }; + + enum RleVersion { +RleVersion_1, +RleVersion_2 + }; + + class Timezone; + + /** + * Options for creating a Writer. + */ + class WriterOptions { + private: +ORC_UNIQUE_PTR privateBits; + + public: +WriterOptions(); +WriterOptions(const WriterOptions&); +WriterOptions(WriterOptions&); +WriterOptions& operator=(const WriterOptions&); +virtual ~WriterOptions(); + +/** + * Set the strip size. + */ +WriterOptions& setStripeSize(uint64_t size); + +/** + * Get the strip size. + * @return if not set, return default value. + */ +uint64_t getStripeSize() const; + +/** + * Set the block size. + */ +WriterOptions& setBlockSize(uint64_t size); + +/** + * Get the block size. + * @return if not set, return default value. + */ +uint64_t getBlockSize() const; + +/** + * Set row index stride. + */ +WriterOptions& setRowIndexStride(uint64_t stride); + +/** + * Get the index stride size. + * @return if not set, return default value. + */ +uint64_t getRowIndexStride() const; + +/** + * Set the buffer size. + */ +WriterOptions& setBufferSize(uint64_t size); + +/** + * Get the buffer size. + * @return if not set, return default value. + */ +uint64_t getBufferSize() const; + +/** + * Set the dictionary key size threshold. + * 0 to disable dictionary encoding. + * 1 to always enable dictionary encoding. + */ +WriterOptions& setDictionaryKeySizeThreshold(double val); + +/** + * Get the dictionary key size threshold. + */ +double getDictionaryKeySizeThreshold() const; + +/** + * Set whether or not to have block padding. + */ +WriterOptions& setBlockPadding(bool padding); + +/** + * Get whether or not to have block padding. + * @return if not set, return default value which is false. + */ +bool getBlockPadding() const; + +/** + * Set Run length encoding version + */ +WriterOptions& setRleVersion(RleVersion version); + +/** + * Get Run Length Encoding version + */ +RleVersion getRleVersion() const; + +/** + * Set compression kind. + */ +WriterOptions& setCompression(CompressionKind comp); + +/** + * Get the compression kind. + * @return if not set, return default value which is ZLIB. + */ +CompressionKind getCompression() const; + +/** + * Set the encoding strategy. + */ +WriterOptions& setEncodingStrategy(EncodingStrategy strategy); + +/** + * Get the encoding strategy. + * @return if