Re: What did ORC writer v3 fix?
The writer version was bumped because it was major change to the writer. We were concerned that if there was a bug, that we would need to detect it. .. Owen > On Jun 6, 2017, at 19:24, Dain Sundstrom wrote: > > On reading the HIVE-12055 referenced from the docs for writer version 3, I’m > not sure what was fixed. Does any one remember? > > -dain
What did ORC writer v3 fix?
On reading the HIVE-12055 referenced from the docs for writer version 3, I’m not sure what was fixed. Does any one remember? -dain
Re: String stats requirements?
> I agree that we want to be able to trim the values. I've seen cases where > the String is huge (~100k) and makes the StringStatistics huge. I'd propose > that we do something like: The only concrete consumer of this data outside of ORC readers is probably partial scan computation of statistics from the footers. In some cases, I find it better to avoid computing min-max ranges, when the strings exceed a useful length as keeping that updated involves a comparison for every new row. Long json strings or URLs usually are slower to write simply from this comparison. So this is a great idea, with the appropriate indication to the partial scan reader not to update stats for those columns. Cheers, Gopal
[GitHub] orc pull request #129: Document writer version switch to UTF-8 stats
GitHub user dain opened a pull request: https://github.com/apache/orc/pull/129 Document writer version switch to UTF-8 stats Note that ORC writer version 1 switched from UTF-16be string stats to UTF-8 stats. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dain/orc patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/129.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #129 commit 76905f1c37b91c61fe38c5ed47d8a0536d25eb36 Author: Dain Sundstrom Date: 2017-06-07T01:24:22Z Document writer version switch to UTF-8 stats Note that ORC writer version 1 switched from UTF-16be string stats to UTF-8 stats. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: String stats requirements?
Yes, HIVE-7144 went in before HIVE-8732 so any file with WriterVersion >= 1 should be UTF-8 in the statistics. .. Owen On Tue, Jun 6, 2017 at 4:05 PM, Dain Sundstrom wrote: > Ah I see. I can’t believe I missed this fix :) > > Our reader was originally written in the 0.13 days, and which used Strings > for stats. This is the commit that changed everything to text and I > believe it went out with Hive 0.14: > > https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15 > a88e975b4e#diff-340d190f994d92658b24aae1edf610b3 > > Is writer version "1 = HIVE-8732 fixed” after 0.14? If so I can update my > reader to detect this. > > -dain > > > On Jun 6, 2017, at 3:36 PM, Owen O'Malley > wrote: > > > > On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom wrote: > > > >> Is it required that the StringStatistics min and max be the actual min > and > >> max value for the column? I ask for two reasons, I’d like to be able to > >> “trim” values if the min or max is very large. Also, as a work around > of > >> for the UTF-16be sorting problem (bug?), I’d like to trim values at the > >> first surrogate pair, so the value is slightly smaller than the min or > >> larger than the max, and still a valid UTF-8 sequence. > >> > > > > I agree that we want to be able to trim the values. I've seen cases where > > the String is huge (~100k) and makes the StringStatistics huge. I'd > propose > > that we do something like: > > > > message StringStatistics { > > optional string minimum = 1; > > optional string maximum = 2; > > // sum will store the total length of all strings in a stripe > > optional sint64 sum = 3; > > // if set, the minimum will not be set and the lowerBound <= all values > > optional string lowerBound = 4; > > // if set, the maximum will not be set and the upperBound >= all values > > optional string upperBound = 5; > > } > > > > We shouldn't have any UTF16 in ORC. Is there a case where we compare > > strings that way? In particular, the StringStatistics uses Text, which > uses > > UTF-8 as its encoding. > > > > .. Owen > > > > > >> Thoughts? > >> > >> -dain > >> > >> > >
Re: String stats requirements?
Ah I see. I can’t believe I missed this fix :) Our reader was originally written in the 0.13 days, and which used Strings for stats. This is the commit that changed everything to text and I believe it went out with Hive 0.14: https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15a88e975b4e#diff-340d190f994d92658b24aae1edf610b3 Is writer version "1 = HIVE-8732 fixed” after 0.14? If so I can update my reader to detect this. -dain > On Jun 6, 2017, at 3:36 PM, Owen O'Malley wrote: > > On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom wrote: > >> Is it required that the StringStatistics min and max be the actual min and >> max value for the column? I ask for two reasons, I’d like to be able to >> “trim” values if the min or max is very large. Also, as a work around of >> for the UTF-16be sorting problem (bug?), I’d like to trim values at the >> first surrogate pair, so the value is slightly smaller than the min or >> larger than the max, and still a valid UTF-8 sequence. >> > > I agree that we want to be able to trim the values. I've seen cases where > the String is huge (~100k) and makes the StringStatistics huge. I'd propose > that we do something like: > > message StringStatistics { > optional string minimum = 1; > optional string maximum = 2; > // sum will store the total length of all strings in a stripe > optional sint64 sum = 3; > // if set, the minimum will not be set and the lowerBound <= all values > optional string lowerBound = 4; > // if set, the maximum will not be set and the upperBound >= all values > optional string upperBound = 5; > } > > We shouldn't have any UTF16 in ORC. Is there a case where we compare > strings that way? In particular, the StringStatistics uses Text, which uses > UTF-8 as its encoding. > > .. Owen > > >> Thoughts? >> >> -dain >> >>
Re: "For dictionary encodings the dictionary is sorted"
Sorry, I should have given a full example. I mean we could change StringStatistics as follows: message StringStatistics { optional string minimum = 1; optional string maximum = 2; optional sint64 sum = 3; optional string minimumUtf8 = 4; optional string maximumUtf8 = 5; } New writers would fill in both in the transition period, and then just the UTF8 ones. Then we update readers to the favor the new ones. We can also make this backwards compatible in the transition period by truncating values to at the first surrogate pair. This is how we do that in Presto: public static Slice getMaxSlice(String maximum) { if (maximum == null) { return null; } int index = firstSurrogateCharacter(maximum); if (index == -1) { return Slices.utf8Slice(maximum); } // Append 0xFF so that it is larger than maximum return concatSlices(Slices.utf8Slice(maximum.substring(0, index)), MAX_BYTE); } public static Slice getMinSlice(String minimum) { if (minimum == null) { return null; } int index = firstSurrogateCharacter(minimum); if (index == -1) { return Slices.utf8Slice(minimum); } // truncate the string at the first surrogate character return Slices.utf8Slice(minimum.substring(0, index)); } // returns index of first surrogateCharacter in the string -1 if no surrogate character is found static int firstSurrogateCharacter(String value) { char[] chars = value.toCharArray(); for (int i = 0; i < chars.length; i++) { if (chars[i] >= MIN_SURROGATE) { return i; } } return -1; } The getMaxSliceCode might need to be adjusted depending on how it it used. We are using UTF-8 binary so we can use the illegal 0xFF as the stop byte, but if you want to transition back to Java Strings, we would need to be a bit smarter. -dain > On Jun 6, 2017, at 3:39 PM, Owen O'Malley wrote: > > I'm confused. TimestampStatistics uses integers not strings. > > .. Owen > > On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom wrote: > >> >>> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom wrote: >>> On Dec 12, 2016, at 4:36 PM, Owen O'Malley wrote: > I think this should also be documented in the statistics section which also uses UTF-16 BE, which is at least consistent, but still annoying >> for everything other than Java. Yes, it should be documented and we should replace it with UTF-8. >> (Although changes to the serialized form are always painful.) >>> >>> I think we can do something similar to the bloom filter code, where we >> add a StringUtf8Stats object and have a transition period where we can >> produce both. >> >> I was looking at the change proto changes to TimestampStatistics, and I >> think the same thing could work here. We add: >> >>optional string minimumUtf8 = 4; >>optional string maximumUtf8 = 5; >> >> and the update the writer write just the UTF-8 version (or both during a >> transition). >> >> -dain
Re: "For dictionary encodings the dictionary is sorted"
I'm confused. TimestampStatistics uses integers not strings. .. Owen On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom wrote: > > > On Dec 12, 2016, at 4:48 PM, Dain Sundstrom wrote: > > On Dec 12, 2016, at 4:36 PM, Owen O'Malley wrote: > >>> I think this should also be documented in the statistics section which > >> also uses UTF-16 BE, which is at least consistent, but still annoying > for > >> everything other than Java. > >> > >> Yes, it should be documented and we should replace it with UTF-8. > (Although > >> changes to the serialized form are always painful.) > > > > I think we can do something similar to the bloom filter code, where we > add a StringUtf8Stats object and have a transition period where we can > produce both. > > I was looking at the change proto changes to TimestampStatistics, and I > think the same thing could work here. We add: > > optional string minimumUtf8 = 4; > optional string maximumUtf8 = 5; > > and the update the writer write just the UTF-8 version (or both during a > transition). > > -dain
Re: String stats requirements?
On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom wrote: > Is it required that the StringStatistics min and max be the actual min and > max value for the column? I ask for two reasons, I’d like to be able to > “trim” values if the min or max is very large. Also, as a work around of > for the UTF-16be sorting problem (bug?), I’d like to trim values at the > first surrogate pair, so the value is slightly smaller than the min or > larger than the max, and still a valid UTF-8 sequence. > I agree that we want to be able to trim the values. I've seen cases where the String is huge (~100k) and makes the StringStatistics huge. I'd propose that we do something like: message StringStatistics { optional string minimum = 1; optional string maximum = 2; // sum will store the total length of all strings in a stripe optional sint64 sum = 3; // if set, the minimum will not be set and the lowerBound <= all values optional string lowerBound = 4; // if set, the maximum will not be set and the upperBound >= all values optional string upperBound = 5; } We shouldn't have any UTF16 in ORC. Is there a case where we compare strings that way? In particular, the StringStatistics uses Text, which uses UTF-8 as its encoding. .. Owen > Thoughts? > > -dain > >
String stats requirements?
Is it required that the StringStatistics min and max be the actual min and max value for the column? I ask for two reasons, I’d like to be able to “trim” values if the min or max is very large. Also, as a work around of for the UTF-16be sorting problem (bug?), I’d like to trim values at the first surrogate pair, so the value is slightly smaller than the min or larger than the max, and still a valid UTF-8 sequence. Thoughts? -dain
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120483146 --- Diff: c++/include/orc/Writer.hh --- @@ -0,0 +1,294 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef ORC_WRITER_HH +#define ORC_WRITER_HH + +#include "orc/Common.hh" +#include "orc/orc-config.hh" +#include "orc/Type.hh" +#include "orc/Vector.hh" + +#include +#include +#include + +namespace orc { + + // classes that hold data members so we can maintain binary compatibility + struct WriterOptionsPrivate; + + enum EncodingStrategy { +EncodingStrategy_SPEED = 0, +EncodingStrategy_COMPRESSION + }; + + enum CompressionStrategy { +CompressionStrategy_SPEED = 0, +CompressionStrategy_COMPRESSION + }; + + enum RleVersion { +RleVersion_1, +RleVersion_2 + }; + + class Timezone; + + /** + * Options for creating a Writer. + */ + class WriterOptions { + private: +ORC_UNIQUE_PTR privateBits; + + public: +WriterOptions(); +WriterOptions(const WriterOptions&); +WriterOptions(WriterOptions&); +WriterOptions& operator=(const WriterOptions&); +virtual ~WriterOptions(); + +/** + * Set the strip size. + */ +WriterOptions& setStripeSize(uint64_t size); + +/** + * Get the strip size. + * @return if not set, return default value. + */ +uint64_t getStripeSize() const; + +/** + * Set the block size. + */ +WriterOptions& setBlockSize(uint64_t size); + +/** + * Get the block size. + * @return if not set, return default value. + */ +uint64_t getBlockSize() const; + +/** + * Set row index stride. + */ +WriterOptions& setRowIndexStride(uint64_t stride); + +/** + * Get the index stride size. + * @return if not set, return default value. + */ +uint64_t getRowIndexStride() const; + +/** + * Set the buffer size. + */ +WriterOptions& setBufferSize(uint64_t size); + +/** + * Get the buffer size. + * @return if not set, return default value. + */ +uint64_t getBufferSize() const; + +/** + * Set the dictionary key size threshold. + * 0 to disable dictionary encoding. + * 1 to always enable dictionary encoding. + */ +WriterOptions& setDictionaryKeySizeThreshold(double val); + +/** + * Get the dictionary key size threshold. + */ +double getDictionaryKeySizeThreshold() const; + +/** + * Set whether or not to have block padding. + */ +WriterOptions& setBlockPadding(bool padding); + +/** + * Get whether or not to have block padding. + * @return if not set, return default value which is false. + */ +bool getBlockPadding() const; + +/** + * Set Run length encoding version + */ +WriterOptions& setRleVersion(RleVersion version); + +/** + * Get Run Length Encoding version + */ +RleVersion getRleVersion() const; + +/** + * Set compression kind. + */ +WriterOptions& setCompression(CompressionKind comp); + +/** + * Get the compression kind. + * @return if not set, return default value which is ZLIB. + */ +CompressionKind getCompression() const; + +/** + * Set the encoding strategy. + */ +WriterOptions& setEncodingStrategy(EncodingStrategy strategy); + +/** + * Get the encoding strategy. + * @return if not set, return default value which is SPEED. + */ +EncodingStrategy getEncodingStrategy() const; + +/**
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120476831 --- Diff: c++/src/Writer.cc --- @@ -0,0 +1,659 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Common.hh" +#include "orc/OrcFile.hh" + +#include "ColumnWriter.hh" +#include "Timezone.hh" + +#include + +namespace orc { + + struct WriterOptionsPrivate { +uint64_t stripeSize; +uint64_t blockSize; +uint64_t rowIndexStride; +uint64_t bufferSize; +bool blockPadding; +CompressionKind compression; +EncodingStrategy encodingStrategy; +CompressionStrategy compressionStrategy; +MemoryPool *memoryPool; +WriterVersion version; +double paddingTolerance; +std::ostream* errorStream; +RleVersion rleVersion; +double dictionaryKeySizeThreshold; +bool enableStats; +bool enableStrStatsCmp; +bool enableIndex; +const Timezone* timezone; + +WriterOptionsPrivate() { + stripeSize = 64 * 1024 * 1024; // 64M + blockSize = 256 * 1024; // 256K + rowIndexStride = 1; + bufferSize = 4 * 1024 * 1024; // 4M + blockPadding = false; + compression = CompressionKind_ZLIB; + encodingStrategy = EncodingStrategy_SPEED; + compressionStrategy = CompressionStrategy_SPEED; + memoryPool = getDefaultPool(); + version = WriterVersion_ORC_135; + paddingTolerance = 0.0; + errorStream = &std::cerr; + rleVersion = RleVersion_1; + dictionaryKeySizeThreshold = 0.0; + enableStats = true; + enableStrStatsCmp = false; + enableIndex = true; + timezone = &getLocalTimezone(); +} + }; + + WriterOptions::WriterOptions(): +privateBits(std::unique_ptr +(new WriterOptionsPrivate())) { +// PASS + } + + WriterOptions::WriterOptions(const WriterOptions& rhs): +privateBits(std::unique_ptr +(new WriterOptionsPrivate(*(rhs.privateBits.get() { +// PASS + } + + WriterOptions::WriterOptions(WriterOptions& rhs) { +// swap privateBits with rhs +WriterOptionsPrivate* l = privateBits.release(); +privateBits.reset(rhs.privateBits.release()); +rhs.privateBits.reset(l); + } + + WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) { +if (this != &rhs) { + privateBits.reset(new WriterOptionsPrivate(*(rhs.privateBits.get(; +} +return *this; + } + + WriterOptions::~WriterOptions() { +// PASS + } + + WriterOptions& WriterOptions::setStripeSize(uint64_t size) { +privateBits->stripeSize = size; +return *this; + } + + uint64_t WriterOptions::getStripeSize() const { +return privateBits->stripeSize; + } + + WriterOptions& WriterOptions::setBlockSize(uint64_t size) { +privateBits->blockSize = size; +return *this; + } + + uint64_t WriterOptions::getBlockSize() const { +return privateBits->blockSize; + } + + WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) { +privateBits->rowIndexStride = stride; +return *this; + } + + uint64_t WriterOptions::getRowIndexStride() const { +return privateBits->rowIndexStride; + } + + WriterOptions& WriterOptions::setBufferSize(uint64_t size) { +privateBits->bufferSize = size; +return *this; + } + + uint64_t WriterOptions::getBufferSize() const { +return privateBits->bufferSize; + } + + WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) { +privateBits->dictionaryKeySizeThreshold = val; +return *this; + } + + double WriterOptions::getDictionaryKeySizeThreshold() const { +return privateBi
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r119640657 --- Diff: c++/include/orc/OrcFile.hh --- @@ -119,6 +120,17 @@ namespace orc { * @param path the name of the file in the local file system */ ORC_UNIQUE_PTR writeLocalFile(const std::string& path); + + /** + * Create a writer to write the ORC file. + * @param type the type of data to be written + * @param stream the stream to write to + * @param options the options for writing the file + */ + ORC_UNIQUE_PTR createWriter( + const Type& type, + OutputStream * stream, --- End diff -- remove space `OutputStream*` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120458444 --- Diff: c++/src/ColumnWriter.cc --- @@ -0,0 +1,507 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Int128.hh" +#include "orc/Writer.hh" + +#include "ByteRLE.hh" +#include "ColumnWriter.hh" +#include "RLE.hh" +#include "Statistics.hh" +#include "Timezone.hh" + +namespace orc { + StreamsFactory::~StreamsFactory() { +//PASS + } + + class StreamsFactoryImpl : public StreamsFactory { + public: +StreamsFactoryImpl( + const WriterOptions& writerOptions, + OutputStream * outputStream) : + options(writerOptions), + outStream(outputStream) { + } + +virtual std::unique_ptr +createStream(proto::Stream_Kind kind) override; + private: +const WriterOptions& options; +OutputStream * outStream; + }; + + std::unique_ptr StreamsFactoryImpl::createStream( + proto::Stream_Kind) { +// In the future, we can decide compression strategy and modifier +// based on stream kind. But for now we just use the setting from +// WriterOption +return createCompressor( +options.getCompression(), +outStream, +options.getCompressionStrategy(), +options.getBufferSize(), +options.getBlockSize(), +*options.getMemoryPool()); + } + + std::unique_ptr createStreamsFactory( +const WriterOptions& options, +OutputStream * outStream) { +return std::unique_ptr( + new StreamsFactoryImpl(options, outStream)); + } + + RowIndexPositionRecorder::~RowIndexPositionRecorder() { +// PASS + } + + ColumnWriter::ColumnWriter( + const Type& type, + StreamsFactory& factory, + const WriterOptions& options) : +columnId(type.getColumnId()), +streamsFactory(factory), +colIndexStatistics(), +colStripeStatistics(), +colFileStatistics(), +enableIndex(options.getEnableIndex()), +enableStats(options.getEnableStats()), +rowIndex(), +rowIndexEntry(), +rowIndexPosition(), +memPool(*options.getMemoryPool()), +indexStream() { + +std::unique_ptr presentStream = +factory.createStream(proto::Stream_Kind_PRESENT); +notNullEncoder = createBooleanRleEncoder(std::move(presentStream)); + +if (enableIndex || enableStats) { + bool enableStrCmp = options.getEnableStrStatsCmp(); + colIndexStatistics = createColumnStatistics(type, enableStrCmp); + if (enableStats) { +colStripeStatistics = createColumnStatistics(type, enableStrCmp); +colFileStatistics = createColumnStatistics(type, enableStrCmp); + } +} + +if (enableIndex) { + rowIndex = std::unique_ptr(new proto::RowIndex()); + rowIndexEntry = +std::unique_ptr(new proto::RowIndexEntry()); + rowIndexPosition = std::unique_ptr( + new RowIndexPositionRecorder(*rowIndexEntry)); + indexStream = +factory.
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r119641608 --- Diff: c++/src/ColumnWriter.cc --- @@ -0,0 +1,507 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Int128.hh" +#include "orc/Writer.hh" + +#include "ByteRLE.hh" +#include "ColumnWriter.hh" +#include "RLE.hh" +#include "Statistics.hh" +#include "Timezone.hh" + +namespace orc { + StreamsFactory::~StreamsFactory() { +//PASS + } + + class StreamsFactoryImpl : public StreamsFactory { + public: +StreamsFactoryImpl( + const WriterOptions& writerOptions, + OutputStream * outputStream) : --- End diff -- Remove space `OutputStream*` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120458268 --- Diff: c++/src/ColumnWriter.cc --- @@ -0,0 +1,507 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Int128.hh" +#include "orc/Writer.hh" + +#include "ByteRLE.hh" +#include "ColumnWriter.hh" +#include "RLE.hh" +#include "Statistics.hh" +#include "Timezone.hh" + +namespace orc { + StreamsFactory::~StreamsFactory() { +//PASS + } + + class StreamsFactoryImpl : public StreamsFactory { + public: +StreamsFactoryImpl( + const WriterOptions& writerOptions, + OutputStream * outputStream) : + options(writerOptions), + outStream(outputStream) { + } + +virtual std::unique_ptr +createStream(proto::Stream_Kind kind) override; + private: +const WriterOptions& options; +OutputStream * outStream; + }; + + std::unique_ptr StreamsFactoryImpl::createStream( + proto::Stream_Kind) { +// In the future, we can decide compression strategy and modifier +// based on stream kind. But for now we just use the setting from +// WriterOption +return createCompressor( +options.getCompression(), +outStream, +options.getCompressionStrategy(), +options.getBufferSize(), +options.getBlockSize(), +*options.getMemoryPool()); + } + + std::unique_ptr createStreamsFactory( +const WriterOptions& options, +OutputStream * outStream) { +return std::unique_ptr( + new StreamsFactoryImpl(options, outStream)); + } + + RowIndexPositionRecorder::~RowIndexPositionRecorder() { +// PASS + } + + ColumnWriter::ColumnWriter( + const Type& type, + StreamsFactory& factory, + const WriterOptions& options) : +columnId(type.getColumnId()), +streamsFactory(factory), +colIndexStatistics(), +colStripeStatistics(), +colFileStatistics(), +enableIndex(options.getEnableIndex()), +enableStats(options.getEnableStats()), +rowIndex(), +rowIndexEntry(), +rowIndexPosition(), +memPool(*options.getMemoryPool()), +indexStream() { + +std::unique_ptr presentStream = +factory.createStream(proto::Stream_Kind_PRESENT); +notNullEncoder = createBooleanRleEncoder(std::move(presentStream)); + +if (enableIndex || enableStats) { + bool enableStrCmp = options.getEnableStrStatsCmp(); + colIndexStatistics = createColumnStatistics(type, enableStrCmp); + if (enableStats) { +colStripeStatistics = createColumnStatistics(type, enableStrCmp); +colFileStatistics = createColumnStatistics(type, enableStrCmp); + } +} + +if (enableIndex) { + rowIndex = std::unique_ptr(new proto::RowIndex()); + rowIndexEntry = +std::unique_ptr(new proto::RowIndexEntry()); + rowIndexPosition = std::unique_ptr( + new RowIndexPositionRecorder(*rowIndexEntry)); + indexStream = +factory.
[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/128#discussion_r120483772 --- Diff: c++/src/ColumnWriter.cc --- @@ -0,0 +1,507 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "orc/Int128.hh" +#include "orc/Writer.hh" + +#include "ByteRLE.hh" +#include "ColumnWriter.hh" +#include "RLE.hh" +#include "Statistics.hh" +#include "Timezone.hh" + +namespace orc { + StreamsFactory::~StreamsFactory() { +//PASS + } + + class StreamsFactoryImpl : public StreamsFactory { + public: +StreamsFactoryImpl( + const WriterOptions& writerOptions, + OutputStream * outputStream) : + options(writerOptions), + outStream(outputStream) { + } + +virtual std::unique_ptr +createStream(proto::Stream_Kind kind) override; + private: +const WriterOptions& options; +OutputStream * outStream; + }; + + std::unique_ptr StreamsFactoryImpl::createStream( + proto::Stream_Kind) { +// In the future, we can decide compression strategy and modifier +// based on stream kind. But for now we just use the setting from +// WriterOption +return createCompressor( +options.getCompression(), +outStream, +options.getCompressionStrategy(), +options.getBufferSize(), +options.getBlockSize(), +*options.getMemoryPool()); + } + + std::unique_ptr createStreamsFactory( +const WriterOptions& options, +OutputStream * outStream) { +return std::unique_ptr( + new StreamsFactoryImpl(options, outStream)); + } + + RowIndexPositionRecorder::~RowIndexPositionRecorder() { +// PASS + } + + ColumnWriter::ColumnWriter( + const Type& type, + StreamsFactory& factory, + const WriterOptions& options) : +columnId(type.getColumnId()), +streamsFactory(factory), +colIndexStatistics(), +colStripeStatistics(), +colFileStatistics(), +enableIndex(options.getEnableIndex()), +enableStats(options.getEnableStats()), +rowIndex(), +rowIndexEntry(), +rowIndexPosition(), +memPool(*options.getMemoryPool()), +indexStream() { + +std::unique_ptr presentStream = +factory.createStream(proto::Stream_Kind_PRESENT); +notNullEncoder = createBooleanRleEncoder(std::move(presentStream)); + +if (enableIndex || enableStats) { + bool enableStrCmp = options.getEnableStrStatsCmp(); + colIndexStatistics = createColumnStatistics(type, enableStrCmp); + if (enableStats) { +colStripeStatistics = createColumnStatistics(type, enableStrCmp); +colFileStatistics = createColumnStatistics(type, enableStrCmp); + } +} + +if (enableIndex) { + rowIndex = std::unique_ptr(new proto::RowIndex()); + rowIndexEntry = +std::unique_ptr(new proto::RowIndexEntry()); + rowIndexPosition = std::unique_ptr( + new RowIndexPositionRecorder(*rowIndexEntry)); + indexStream = +factory.