Re: What did ORC writer v3 fix?

2017-06-06 Thread Owen O'Malley
The writer version was bumped because it was major change to the writer. We 
were concerned that if there was a bug, that we would need to detect it.

.. Owen

> On Jun 6, 2017, at 19:24, Dain Sundstrom  wrote:
> 
> On reading the HIVE-12055 referenced from the docs for writer version 3, I’m 
> not sure what was fixed.  Does any one remember?
> 
> -dain


What did ORC writer v3 fix?

2017-06-06 Thread Dain Sundstrom
On reading the HIVE-12055 referenced from the docs for writer version 3, I’m 
not sure what was fixed.  Does any one remember?

-dain

Re: String stats requirements?

2017-06-06 Thread Gopal Vijayaraghavan
> I agree that we want to be able to trim the values. I've seen cases where
>  the String is huge (~100k) and makes the StringStatistics huge. I'd propose
>  that we do something like:

The only concrete consumer of this data outside of ORC readers is probably
partial scan computation of statistics from the footers.

In some cases, I find it better to avoid computing min-max ranges, when the 
strings 
exceed a useful length as keeping that updated involves a comparison for every
new row.

Long json strings or URLs usually are slower to write simply from this 
comparison.

So this is a great idea, with the appropriate indication to the partial scan 
reader 
not to update stats for those columns.

Cheers,
Gopal





[GitHub] orc pull request #129: Document writer version switch to UTF-8 stats

2017-06-06 Thread dain
GitHub user dain opened a pull request:

https://github.com/apache/orc/pull/129

Document writer version switch to UTF-8 stats

Note that ORC writer version 1 switched from UTF-16be string stats to UTF-8 
stats.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dain/orc patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #129


commit 76905f1c37b91c61fe38c5ed47d8a0536d25eb36
Author: Dain Sundstrom 
Date:   2017-06-07T01:24:22Z

Document writer version switch to UTF-8 stats

Note that ORC writer version 1 switched from UTF-16be string stats to UTF-8 
stats.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: String stats requirements?

2017-06-06 Thread Owen O'Malley
Yes, HIVE-7144 went in before HIVE-8732 so any file with WriterVersion >= 1
should be UTF-8 in the statistics.

.. Owen

On Tue, Jun 6, 2017 at 4:05 PM, Dain Sundstrom  wrote:

> Ah I see.  I can’t believe I missed this fix :)
>
> Our reader was originally written in the 0.13 days, and which used Strings
> for stats.  This is the commit that changed everything to text and I
> believe it went out with Hive 0.14:
>
>   https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15
> a88e975b4e#diff-340d190f994d92658b24aae1edf610b3
>
> Is writer version "1 = HIVE-8732 fixed” after 0.14?  If so I can update my
> reader to detect this.
>
> -dain
>
> > On Jun 6, 2017, at 3:36 PM, Owen O'Malley 
> wrote:
> >
> > On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom  wrote:
> >
> >> Is it required that the StringStatistics min and max be the actual min
> and
> >> max value for the column?  I ask for two reasons, I’d like to be able to
> >> “trim” values if the min or max is very large.  Also, as a work around
> of
> >> for the UTF-16be sorting problem (bug?), I’d like to trim values at the
> >> first surrogate pair, so the value is slightly smaller than the min or
> >> larger than the max, and still a valid UTF-8 sequence.
> >>
> >
> > I agree that we want to be able to trim the values. I've seen cases where
> > the String is huge (~100k) and makes the StringStatistics huge. I'd
> propose
> > that we do something like:
> >
> > message StringStatistics {
> >  optional string minimum = 1;
> >  optional string maximum = 2;
> >  // sum will store the total length of all strings in a stripe
> >  optional sint64 sum = 3;
> >  // if set, the minimum will not be set and the lowerBound <= all values
> >  optional string lowerBound = 4;
> >  // if set, the maximum will not be set and the upperBound >= all values
> >  optional string upperBound = 5;
> > }
> >
> > We shouldn't have any UTF16 in ORC. Is there a case where we compare
> > strings that way? In particular, the StringStatistics uses Text, which
> uses
> > UTF-8 as its encoding.
> >
> > .. Owen
> >
> >
> >> Thoughts?
> >>
> >> -dain
> >>
> >>
>
>


Re: String stats requirements?

2017-06-06 Thread Dain Sundstrom
Ah I see.  I can’t believe I missed this fix :)  

Our reader was originally written in the 0.13 days, and which used Strings for 
stats.  This is the commit that changed everything to text and I believe it 
went out with Hive 0.14:

  
https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15a88e975b4e#diff-340d190f994d92658b24aae1edf610b3

Is writer version "1 = HIVE-8732 fixed” after 0.14?  If so I can update my 
reader to detect this.

-dain

> On Jun 6, 2017, at 3:36 PM, Owen O'Malley  wrote:
> 
> On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom  wrote:
> 
>> Is it required that the StringStatistics min and max be the actual min and
>> max value for the column?  I ask for two reasons, I’d like to be able to
>> “trim” values if the min or max is very large.  Also, as a work around of
>> for the UTF-16be sorting problem (bug?), I’d like to trim values at the
>> first surrogate pair, so the value is slightly smaller than the min or
>> larger than the max, and still a valid UTF-8 sequence.
>> 
> 
> I agree that we want to be able to trim the values. I've seen cases where
> the String is huge (~100k) and makes the StringStatistics huge. I'd propose
> that we do something like:
> 
> message StringStatistics {
>  optional string minimum = 1;
>  optional string maximum = 2;
>  // sum will store the total length of all strings in a stripe
>  optional sint64 sum = 3;
>  // if set, the minimum will not be set and the lowerBound <= all values
>  optional string lowerBound = 4;
>  // if set, the maximum will not be set and the upperBound >= all values
>  optional string upperBound = 5;
> }
> 
> We shouldn't have any UTF16 in ORC. Is there a case where we compare
> strings that way? In particular, the StringStatistics uses Text, which uses
> UTF-8 as its encoding.
> 
> .. Owen
> 
> 
>> Thoughts?
>> 
>> -dain
>> 
>> 



Re: "For dictionary encodings the dictionary is sorted"

2017-06-06 Thread Dain Sundstrom
Sorry, I should have given a full example.  I mean we could change 
StringStatistics as follows:

message StringStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional sint64 sum = 3;
  optional string minimumUtf8 = 4;
  optional string maximumUtf8 = 5;
}

New writers would fill in both in the transition period, and then just the UTF8 
ones.  Then we update readers to the favor the new ones.

We can also make this backwards compatible in the transition period by 
truncating values to at the first surrogate pair.  This is how we do that in 
Presto:

public static Slice getMaxSlice(String maximum)
{
if (maximum == null) {
return null;
}

int index = firstSurrogateCharacter(maximum);
if (index == -1) {
return Slices.utf8Slice(maximum);
}
// Append 0xFF so that it is larger than maximum
return concatSlices(Slices.utf8Slice(maximum.substring(0, index)), 
MAX_BYTE);
}

public static Slice getMinSlice(String minimum)
{
if (minimum == null) {
return null;
}

int index = firstSurrogateCharacter(minimum);
if (index == -1) {
return Slices.utf8Slice(minimum);
}
// truncate the string at the first surrogate character
return Slices.utf8Slice(minimum.substring(0, index));
}

// returns index of first surrogateCharacter in the string -1 if no surrogate 
character is found
static int firstSurrogateCharacter(String value)
{
char[] chars = value.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (chars[i] >= MIN_SURROGATE) {
return i;
}
}
return -1;
}

The getMaxSliceCode might need to be adjusted depending on how it it used.  We 
are using UTF-8 binary so we can use the illegal 0xFF as the stop byte, but if 
you want to transition back to Java Strings, we would need to be a bit smarter.

-dain

> On Jun 6, 2017, at 3:39 PM, Owen O'Malley  wrote:
> 
> I'm confused. TimestampStatistics uses integers not strings.
> 
> .. Owen
> 
> On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom  wrote:
> 
>> 
>>> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom  wrote:
>>> On Dec 12, 2016, at 4:36 PM, Owen O'Malley  wrote:
> I think this should also be documented in the statistics section which
 also uses UTF-16 BE, which is at least consistent, but still annoying
>> for
 everything other than Java.
 
 Yes, it should be documented and we should replace it with UTF-8.
>> (Although
 changes to the serialized form are always painful.)
>>> 
>>> I think we can do something similar to the bloom filter code, where we
>> add a StringUtf8Stats object and have a transition period where we can
>> produce both.
>> 
>> I was looking at the change proto changes to TimestampStatistics, and I
>> think the same thing could work here.  We add:
>> 
>>optional string minimumUtf8 = 4;
>>optional string maximumUtf8 = 5;
>> 
>> and the update the writer write just the UTF-8 version (or both during a
>> transition).
>> 
>> -dain



Re: "For dictionary encodings the dictionary is sorted"

2017-06-06 Thread Owen O'Malley
I'm confused. TimestampStatistics uses integers not strings.

.. Owen

On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom  wrote:

>
> > On Dec 12, 2016, at 4:48 PM, Dain Sundstrom  wrote:
> > On Dec 12, 2016, at 4:36 PM, Owen O'Malley  wrote:
> >>> I think this should also be documented in the statistics section which
> >> also uses UTF-16 BE, which is at least consistent, but still annoying
> for
> >> everything other than Java.
> >>
> >> Yes, it should be documented and we should replace it with UTF-8.
> (Although
> >> changes to the serialized form are always painful.)
> >
> > I think we can do something similar to the bloom filter code, where we
> add a StringUtf8Stats object and have a transition period where we can
> produce both.
>
> I was looking at the change proto changes to TimestampStatistics, and I
> think the same thing could work here.  We add:
>
> optional string minimumUtf8 = 4;
> optional string maximumUtf8 = 5;
>
> and the update the writer write just the UTF-8 version (or both during a
> transition).
>
> -dain


Re: String stats requirements?

2017-06-06 Thread Owen O'Malley
On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom  wrote:

> Is it required that the StringStatistics min and max be the actual min and
> max value for the column?  I ask for two reasons, I’d like to be able to
> “trim” values if the min or max is very large.  Also, as a work around of
> for the UTF-16be sorting problem (bug?), I’d like to trim values at the
> first surrogate pair, so the value is slightly smaller than the min or
> larger than the max, and still a valid UTF-8 sequence.
>

I agree that we want to be able to trim the values. I've seen cases where
the String is huge (~100k) and makes the StringStatistics huge. I'd propose
that we do something like:

message StringStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  // sum will store the total length of all strings in a stripe
  optional sint64 sum = 3;
  // if set, the minimum will not be set and the lowerBound <= all values
  optional string lowerBound = 4;
  // if set, the maximum will not be set and the upperBound >= all values
  optional string upperBound = 5;
}

We shouldn't have any UTF16 in ORC. Is there a case where we compare
strings that way? In particular, the StringStatistics uses Text, which uses
UTF-8 as its encoding.

.. Owen


> Thoughts?
>
> -dain
>
>


String stats requirements?

2017-06-06 Thread Dain Sundstrom
Is it required that the StringStatistics min and max be the actual min and max 
value for the column?  I ask for two reasons, I’d like to be able to “trim” 
values if the min or max is very large.  Also, as a work around of for the 
UTF-16be sorting problem (bug?), I’d like to trim values at the first surrogate 
pair, so the value is slightly smaller than the min or larger than the max, and 
still a valid UTF-8 sequence.

Thoughts?

-dain



[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120483146
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -0,0 +1,294 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef ORC_WRITER_HH
+#define ORC_WRITER_HH
+
+#include "orc/Common.hh"
+#include "orc/orc-config.hh"
+#include "orc/Type.hh"
+#include "orc/Vector.hh"
+
+#include 
+#include 
+#include 
+
+namespace orc {
+
+  // classes that hold data members so we can maintain binary compatibility
+  struct WriterOptionsPrivate;
+
+  enum EncodingStrategy {
+EncodingStrategy_SPEED = 0,
+EncodingStrategy_COMPRESSION
+  };
+
+  enum CompressionStrategy {
+CompressionStrategy_SPEED = 0,
+CompressionStrategy_COMPRESSION
+  };
+
+  enum RleVersion {
+RleVersion_1,
+RleVersion_2
+  };
+
+  class Timezone;
+
+  /**
+   * Options for creating a Writer.
+   */
+  class WriterOptions {
+  private:
+ORC_UNIQUE_PTR privateBits;
+
+  public:
+WriterOptions();
+WriterOptions(const WriterOptions&);
+WriterOptions(WriterOptions&);
+WriterOptions& operator=(const WriterOptions&);
+virtual ~WriterOptions();
+
+/**
+ * Set the strip size.
+ */
+WriterOptions& setStripeSize(uint64_t size);
+
+/**
+ * Get the strip size.
+ * @return if not set, return default value.
+ */
+uint64_t getStripeSize() const;
+
+/**
+ * Set the block size.
+ */
+WriterOptions& setBlockSize(uint64_t size);
+
+/**
+ * Get the block size.
+ * @return if not set, return default value.
+ */
+uint64_t getBlockSize() const;
+
+/**
+ * Set row index stride.
+ */
+WriterOptions& setRowIndexStride(uint64_t stride);
+
+/**
+ * Get the index stride size.
+ * @return if not set, return default value.
+ */
+uint64_t getRowIndexStride() const;
+
+/**
+ * Set the buffer size.
+ */
+WriterOptions& setBufferSize(uint64_t size);
+
+/**
+ * Get the buffer size.
+ * @return if not set, return default value.
+ */
+uint64_t getBufferSize() const;
+
+/**
+ * Set the dictionary key size threshold.
+ * 0 to disable dictionary encoding.
+ * 1 to always enable dictionary encoding.
+ */
+WriterOptions& setDictionaryKeySizeThreshold(double val);
+
+/**
+ * Get the dictionary key size threshold.
+ */
+double getDictionaryKeySizeThreshold() const;
+
+/**
+ * Set whether or not to have block padding.
+ */
+WriterOptions& setBlockPadding(bool padding);
+
+/**
+ * Get whether or not to have block padding.
+ * @return if not set, return default value which is false.
+ */
+bool getBlockPadding() const;
+
+/**
+ * Set Run length encoding version
+ */
+WriterOptions& setRleVersion(RleVersion version);
+
+/**
+ * Get Run Length Encoding version
+ */
+RleVersion getRleVersion() const;
+
+/**
+ * Set compression kind.
+ */
+WriterOptions& setCompression(CompressionKind comp);
+
+/**
+ * Get the compression kind.
+ * @return if not set, return default value which is ZLIB.
+ */
+CompressionKind getCompression() const;
+
+/**
+ * Set the encoding strategy.
+ */
+WriterOptions& setEncodingStrategy(EncodingStrategy strategy);
+
+/**
+ * Get the encoding strategy.
+ * @return if not set, return default value which is SPEED.
+ */
+EncodingStrategy getEncodingStrategy() const;
+
+/**

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120476831
  
--- Diff: c++/src/Writer.cc ---
@@ -0,0 +1,659 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Common.hh"
+#include "orc/OrcFile.hh"
+
+#include "ColumnWriter.hh"
+#include "Timezone.hh"
+
+#include 
+
+namespace orc {
+
+  struct WriterOptionsPrivate {
+uint64_t stripeSize;
+uint64_t blockSize;
+uint64_t rowIndexStride;
+uint64_t bufferSize;
+bool blockPadding;
+CompressionKind compression;
+EncodingStrategy encodingStrategy;
+CompressionStrategy compressionStrategy;
+MemoryPool *memoryPool;
+WriterVersion version;
+double paddingTolerance;
+std::ostream* errorStream;
+RleVersion rleVersion;
+double dictionaryKeySizeThreshold;
+bool enableStats;
+bool enableStrStatsCmp;
+bool enableIndex;
+const Timezone* timezone;
+
+WriterOptionsPrivate() {
+  stripeSize = 64 * 1024 * 1024; // 64M
+  blockSize = 256 * 1024; // 256K
+  rowIndexStride = 1;
+  bufferSize = 4 * 1024 * 1024; // 4M
+  blockPadding = false;
+  compression = CompressionKind_ZLIB;
+  encodingStrategy = EncodingStrategy_SPEED;
+  compressionStrategy = CompressionStrategy_SPEED;
+  memoryPool = getDefaultPool();
+  version = WriterVersion_ORC_135;
+  paddingTolerance = 0.0;
+  errorStream = &std::cerr;
+  rleVersion = RleVersion_1;
+  dictionaryKeySizeThreshold = 0.0;
+  enableStats = true;
+  enableStrStatsCmp = false;
+  enableIndex = true;
+  timezone = &getLocalTimezone();
+}
+  };
+
+  WriterOptions::WriterOptions():
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate())) {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(const WriterOptions& rhs):
+privateBits(std::unique_ptr
+(new WriterOptionsPrivate(*(rhs.privateBits.get() {
+// PASS
+  }
+
+  WriterOptions::WriterOptions(WriterOptions& rhs) {
+// swap privateBits with rhs
+WriterOptionsPrivate* l = privateBits.release();
+privateBits.reset(rhs.privateBits.release());
+rhs.privateBits.reset(l);
+  }
+
+  WriterOptions& WriterOptions::operator=(const WriterOptions& rhs) {
+if (this != &rhs) {
+  privateBits.reset(new 
WriterOptionsPrivate(*(rhs.privateBits.get(;
+}
+return *this;
+  }
+
+  WriterOptions::~WriterOptions() {
+// PASS
+  }
+
+  WriterOptions& WriterOptions::setStripeSize(uint64_t size) {
+privateBits->stripeSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getStripeSize() const {
+return privateBits->stripeSize;
+  }
+
+  WriterOptions& WriterOptions::setBlockSize(uint64_t size) {
+privateBits->blockSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBlockSize() const {
+return privateBits->blockSize;
+  }
+
+  WriterOptions& WriterOptions::setRowIndexStride(uint64_t stride) {
+privateBits->rowIndexStride = stride;
+return *this;
+  }
+
+  uint64_t WriterOptions::getRowIndexStride() const {
+return privateBits->rowIndexStride;
+  }
+
+  WriterOptions& WriterOptions::setBufferSize(uint64_t size) {
+privateBits->bufferSize = size;
+return *this;
+  }
+
+  uint64_t WriterOptions::getBufferSize() const {
+return privateBits->bufferSize;
+  }
+
+  WriterOptions& WriterOptions::setDictionaryKeySizeThreshold(double val) {
+privateBits->dictionaryKeySizeThreshold = val;
+return *this;
+  }
+
+  double WriterOptions::getDictionaryKeySizeThreshold() const {
+return privateBi

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r119640657
  
--- Diff: c++/include/orc/OrcFile.hh ---
@@ -119,6 +120,17 @@ namespace orc {
* @param path the name of the file in the local file system
*/
   ORC_UNIQUE_PTR writeLocalFile(const std::string& path);
+
+  /**
+   * Create a writer to write the ORC file.
+   * @param type the type of data to be written
+   * @param stream the stream to write to
+   * @param options the options for writing the file
+   */
+  ORC_UNIQUE_PTR createWriter(
+  const Type& type,
+  OutputStream * stream,
--- End diff --

remove space `OutputStream*`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120458444
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream * outputStream) :
+   options(writerOptions),
+   outStream(outputStream) {
+   }
+
+virtual std::unique_ptr
+createStream(proto::Stream_Kind kind) override;
+  private:
+const WriterOptions& options;
+OutputStream * outStream;
+  };
+
+  std::unique_ptr StreamsFactoryImpl::createStream(
+  proto::Stream_Kind) {
+// In the future, we can decide compression strategy and modifier
+// based on stream kind. But for now we just use the setting from
+// WriterOption
+return createCompressor(
+options.getCompression(),
+outStream,
+options.getCompressionStrategy(),
+options.getBufferSize(),
+options.getBlockSize(),
+*options.getMemoryPool());
+  }
+
+  std::unique_ptr createStreamsFactory(
+const WriterOptions& options,
+OutputStream * outStream) {
+return std::unique_ptr(
+   new StreamsFactoryImpl(options, 
outStream));
+  }
+
+  RowIndexPositionRecorder::~RowIndexPositionRecorder() {
+// PASS
+  }
+
+  ColumnWriter::ColumnWriter(
+ const Type& type,
+ StreamsFactory& factory,
+ const WriterOptions& options) :
+columnId(type.getColumnId()),
+streamsFactory(factory),
+colIndexStatistics(),
+colStripeStatistics(),
+colFileStatistics(),
+enableIndex(options.getEnableIndex()),
+enableStats(options.getEnableStats()),
+rowIndex(),
+rowIndexEntry(),
+rowIndexPosition(),
+memPool(*options.getMemoryPool()),
+indexStream() {
+
+std::unique_ptr presentStream =
+factory.createStream(proto::Stream_Kind_PRESENT);
+notNullEncoder = createBooleanRleEncoder(std::move(presentStream));
+
+if (enableIndex || enableStats) {
+  bool enableStrCmp = options.getEnableStrStatsCmp();
+  colIndexStatistics = createColumnStatistics(type, enableStrCmp);
+  if (enableStats) {
+colStripeStatistics = createColumnStatistics(type, enableStrCmp);
+colFileStatistics = createColumnStatistics(type, enableStrCmp);
+  }
+}
+
+if (enableIndex) {
+  rowIndex = std::unique_ptr(new proto::RowIndex());
+  rowIndexEntry =
+std::unique_ptr(new proto::RowIndexEntry());
+  rowIndexPosition = std::unique_ptr(
+ new RowIndexPositionRecorder(*rowIndexEntry));
+  indexStream =
+factory.

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r119641608
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream * outputStream) :
--- End diff --

Remove space `OutputStream*`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120458268
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream * outputStream) :
+   options(writerOptions),
+   outStream(outputStream) {
+   }
+
+virtual std::unique_ptr
+createStream(proto::Stream_Kind kind) override;
+  private:
+const WriterOptions& options;
+OutputStream * outStream;
+  };
+
+  std::unique_ptr StreamsFactoryImpl::createStream(
+  proto::Stream_Kind) {
+// In the future, we can decide compression strategy and modifier
+// based on stream kind. But for now we just use the setting from
+// WriterOption
+return createCompressor(
+options.getCompression(),
+outStream,
+options.getCompressionStrategy(),
+options.getBufferSize(),
+options.getBlockSize(),
+*options.getMemoryPool());
+  }
+
+  std::unique_ptr createStreamsFactory(
+const WriterOptions& options,
+OutputStream * outStream) {
+return std::unique_ptr(
+   new StreamsFactoryImpl(options, 
outStream));
+  }
+
+  RowIndexPositionRecorder::~RowIndexPositionRecorder() {
+// PASS
+  }
+
+  ColumnWriter::ColumnWriter(
+ const Type& type,
+ StreamsFactory& factory,
+ const WriterOptions& options) :
+columnId(type.getColumnId()),
+streamsFactory(factory),
+colIndexStatistics(),
+colStripeStatistics(),
+colFileStatistics(),
+enableIndex(options.getEnableIndex()),
+enableStats(options.getEnableStats()),
+rowIndex(),
+rowIndexEntry(),
+rowIndexPosition(),
+memPool(*options.getMemoryPool()),
+indexStream() {
+
+std::unique_ptr presentStream =
+factory.createStream(proto::Stream_Kind_PRESENT);
+notNullEncoder = createBooleanRleEncoder(std::move(presentStream));
+
+if (enableIndex || enableStats) {
+  bool enableStrCmp = options.getEnableStrStatsCmp();
+  colIndexStatistics = createColumnStatistics(type, enableStrCmp);
+  if (enableStats) {
+colStripeStatistics = createColumnStatistics(type, enableStrCmp);
+colFileStatistics = createColumnStatistics(type, enableStrCmp);
+  }
+}
+
+if (enableIndex) {
+  rowIndex = std::unique_ptr(new proto::RowIndex());
+  rowIndexEntry =
+std::unique_ptr(new proto::RowIndexEntry());
+  rowIndexPosition = std::unique_ptr(
+ new RowIndexPositionRecorder(*rowIndexEntry));
+  indexStream =
+factory.

[GitHub] orc pull request #128: ORC-178 Implement Basic C++ Writer and Writer Option

2017-06-06 Thread majetideepak
Github user majetideepak commented on a diff in the pull request:

https://github.com/apache/orc/pull/128#discussion_r120483772
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -0,0 +1,507 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "orc/Int128.hh"
+#include "orc/Writer.hh"
+
+#include "ByteRLE.hh"
+#include "ColumnWriter.hh"
+#include "RLE.hh"
+#include "Statistics.hh"
+#include "Timezone.hh"
+
+namespace orc {
+  StreamsFactory::~StreamsFactory() {
+//PASS
+  }
+
+  class StreamsFactoryImpl : public StreamsFactory {
+  public:
+StreamsFactoryImpl(
+   const WriterOptions& writerOptions,
+   OutputStream * outputStream) :
+   options(writerOptions),
+   outStream(outputStream) {
+   }
+
+virtual std::unique_ptr
+createStream(proto::Stream_Kind kind) override;
+  private:
+const WriterOptions& options;
+OutputStream * outStream;
+  };
+
+  std::unique_ptr StreamsFactoryImpl::createStream(
+  proto::Stream_Kind) {
+// In the future, we can decide compression strategy and modifier
+// based on stream kind. But for now we just use the setting from
+// WriterOption
+return createCompressor(
+options.getCompression(),
+outStream,
+options.getCompressionStrategy(),
+options.getBufferSize(),
+options.getBlockSize(),
+*options.getMemoryPool());
+  }
+
+  std::unique_ptr createStreamsFactory(
+const WriterOptions& options,
+OutputStream * outStream) {
+return std::unique_ptr(
+   new StreamsFactoryImpl(options, 
outStream));
+  }
+
+  RowIndexPositionRecorder::~RowIndexPositionRecorder() {
+// PASS
+  }
+
+  ColumnWriter::ColumnWriter(
+ const Type& type,
+ StreamsFactory& factory,
+ const WriterOptions& options) :
+columnId(type.getColumnId()),
+streamsFactory(factory),
+colIndexStatistics(),
+colStripeStatistics(),
+colFileStatistics(),
+enableIndex(options.getEnableIndex()),
+enableStats(options.getEnableStats()),
+rowIndex(),
+rowIndexEntry(),
+rowIndexPosition(),
+memPool(*options.getMemoryPool()),
+indexStream() {
+
+std::unique_ptr presentStream =
+factory.createStream(proto::Stream_Kind_PRESENT);
+notNullEncoder = createBooleanRleEncoder(std::move(presentStream));
+
+if (enableIndex || enableStats) {
+  bool enableStrCmp = options.getEnableStrStatsCmp();
+  colIndexStatistics = createColumnStatistics(type, enableStrCmp);
+  if (enableStats) {
+colStripeStatistics = createColumnStatistics(type, enableStrCmp);
+colFileStatistics = createColumnStatistics(type, enableStrCmp);
+  }
+}
+
+if (enableIndex) {
+  rowIndex = std::unique_ptr(new proto::RowIndex());
+  rowIndexEntry =
+std::unique_ptr(new proto::RowIndexEntry());
+  rowIndexPosition = std::unique_ptr(
+ new RowIndexPositionRecorder(*rowIndexEntry));
+  indexStream =
+factory.