[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...

2018-06-18 Thread jamesclampffer
Github user jamesclampffer commented on the issue:

https://github.com/apache/orc/pull/275
  
@majetideepak it looks like the build still fails with this in place.  
OrcHdfsFile.cc includes hdfspp.h which won't exist if libhdfs++ isn't extracted.

```
/apache_orc/build$ cmake .. -DBUILD_JAVA=OFF
-- No build type selected, default to ReleaseWithDebugInfo
-- compiler GNU version 4.9.4
-- Could NOT find CYRUS_SASL (missing: CYRUS_SASL_SHARED_LIB 
CYRUS_SASL_INCLUDE_DIR) 
-- WARNING: Libhdfs++ library was not built because the required CyrusSASL 
library was not found
-- Configuring done
-- Generating done
-- Build files have been written to: /apache_orc/build
:~/apache_orc/build$ make package
[  7%] Built target googletest_ep
[ 14%] Built target snappy_ep
[ 21%] Built target zlib_ep
[ 28%] Built target lz4_ep
[ 35%] Built target protobuf_ep
[ 36%] Building CXX object c++/src/CMakeFiles/orc.dir/OrcHdfsFile.cc.o
/apache_orc/c++/src/OrcHdfsFile.cc:32:27: fatal error: 
hdfspp/hdfspp.h: No such file or directory
 #include "hdfspp/hdfspp.h"
   ^
compilation terminated.
c++/src/CMakeFiles/orc.dir/build.make:693: recipe for target 
'c++/src/CMakeFiles/orc.dir/OrcHdfsFile.cc.o' failed

```





---


[GitHub] orc issue #281: ORC-375: Update libhdfs to compile with gcc7.

2018-06-06 Thread jamesclampffer
Github user jamesclampffer commented on the issue:

https://github.com/apache/orc/pull/281
  
@omalley I have a patch posted at 
https://issues.apache.org/jira/browse/HDFS-13534 but need another hadoop 
committer to review before I can push it.  Do you want to take a look?


---


[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...

2018-05-31 Thread jamesclampffer
Github user jamesclampffer commented on the issue:

https://github.com/apache/orc/pull/275
  
Rather than skipping the entire libhdfs++ build you could set the NO_SASL 
parameter if openssl or cyrus is missing to build libhdfs++ without support for 
strong authentication.

At one point someone added a partial implementation of SASL DIGEGT-MD5 
directly in libhdfs++ and I think that may try to use SSL's RNG to generate 
nonces (and looks like NO_SASL might not prevent that).  If that breaks the 
build with NO_SASL set we can get rid of that code in libhdfs++.  Even if that 
implementation worked it'd prevent the library from being FIPS 140-2 compliant 
which is a deal breaker for my work.


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread jamesclampffer
Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191935614
  
--- Diff: c++/src/RLE.hh ---
@@ -68,7 +76,24 @@ namespace orc {
  * record current position
  * @param recorder use the recorder to record current positions
  */
-virtual void recordPosition(PositionRecorder* recorder) const = 0;
+virtual void recordPosition(PositionRecorder* recorder) const;
+
+  protected:
+std::unique_ptr outputStream;
+size_t bufferPosition;
+size_t bufferLength;
+size_t numLiterals;
+int64_t* literals;
+bool isSigned;
+char* buffer;
--- End diff --

Is this initialized anywhere?


---


[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2

2018-05-30 Thread jamesclampffer
Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/273#discussion_r191932328
  
--- Diff: c++/include/orc/Writer.hh ---
@@ -38,6 +38,11 @@ namespace orc {
 CompressionStrategy_COMPRESSION
   };
 
+  enum RleVersion {
+RleVersion_1,
--- End diff --

You may want to explicitly assign values here to ensure the addition of new 
encodings or refactoring work can't unintentionally change the underlying 
integer values.


---


[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-09-20 Thread jamesclampffer
Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r140077825
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types

2017-09-20 Thread jamesclampffer
Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/149#discussion_r140067969
  
--- Diff: c++/src/ColumnWriter.cc ---
@@ -468,25 +472,1099 @@ namespace orc {
 rleEncoder->recordPosition(rowIndexPosition.get());
   }
 
-  std::unique_ptr buildWriter(
-const Type& type,
-const StreamsFactory& factory,
-const WriterOptions& options) {
-switch (static_cast(type.getKind())) {
-  case STRUCT:
-return std::unique_ptr(
-  new StructColumnWriter(
- type,
- factory,
- options));
-  case INT:
-  case LONG:
-  case SHORT:
-return std::unique_ptr(
-  new IntegerColumnWriter(
-  type,
-  factory,
-  options));
+  class ByteColumnWriter : public ColumnWriter {
+  public:
+ByteColumnWriter(const Type& type,
+ const StreamsFactory& factory,
+ const WriterOptions& options);
+
+virtual void add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) override;
+
+virtual void flush(std::vector& streams) override;
+
+virtual uint64_t getEstimatedSize() const override;
+
+virtual void getColumnEncoding(
+std::vector& encodings) const override;
+
+virtual void recordPosition() const override;
+
+  private:
+std::unique_ptr byteRleEncoder;
+  };
+
+  ByteColumnWriter::ByteColumnWriter(
+const Type& type,
+const StreamsFactory& factory,
+const WriterOptions& options) :
+ ColumnWriter(type, factory, options) {
+std::unique_ptr dataStream =
+  
factory.createStream(proto::Stream_Kind_DATA);
+byteRleEncoder = createByteRleEncoder(std::move(dataStream));
+
+if (enableIndex) {
+  recordPosition();
+}
+  }
+
+  void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
+ uint64_t offset,
+ uint64_t numValues) {
+ColumnWriter::add(rowBatch, offset, numValues);
+
+LongVectorBatch& byteBatch =
+   dynamic_cast<LongVectorBatch&>(rowBatch);
+
+int64_t* data = byteBatch.data.data() + offset;
+const char* notNull = byteBatch.hasNulls ?
+  byteBatch.notNull.data() + offset : nullptr;
+
+char* byteData = reinterpret_cast<char*>(data);
+for (uint64_t i = 0; i < numValues; ++i) {
+  byteData[i] = static_cast(data[i]);
+}
+byteRleEncoder->add(byteData, numValues, notNull);
+
+IntegerColumnStatisticsImpl* intStats =
+
dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get());
+bool hasNull = false;
+for (uint64_t i = 0; i < numValues; ++i) {
+  if (notNull == nullptr || notNull[i]) {
+intStats->increase(1);
+intStats->update(static_cast(byteData[i]), 1);
+  } else if (!hasNull) {
+hasNull = true;
+  }
+}
+intStats->setHasNull(hasNull);
+  }
+
+  void ByteColumnWriter::flush(std::vector& streams) {
+ColumnWriter::flush(streams);
+
+proto::Stream stream;
+stream.set_kind(proto::Stream_Kind_DATA);
+stream.set_column(static_cast(columnId));
+stream.set_length(byteRleEncoder->flush());
+streams.push_back(stream);
+  }
+
+  uint64_t ByteColumnWriter::getEstimatedSize() const {
+uint64_t size = ColumnWriter::getEstimatedSize();
+size += byteRleEncoder->getBufferSize();
+return size;
+  }
+
+  void ByteColumnWriter::getColumnEncoding(
+std::vector& encodings) const {
+proto::ColumnEncoding encoding;
+encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT);
+encoding.set_dictionarysize(0);
+encodings.push_back(encoding);
+  }
+
+  void ByteColumnWriter::recordPosition() const {
+ColumnWriter::recordPosition();
+byteRleEncoder->recordPosition(rowIndexPosition.get());
+  }
+
+  class BooleanColumnWriter : public ColumnWriter {
+  public:
+BooleanColumnWriter(const Type

[GitHub] orc issue #134: Orc 17

2017-08-18 Thread jamesclampffer
Github user jamesclampffer commented on the issue:

https://github.com/apache/orc/pull/134
  
@majetideepak Is there a way to get the OSX CI to use a generic LLVM/Clang 
install rather than the one that gets shipped with OSX?

Apparently they apply a patch to Clang to break thread_local, my 
understanding is they don't want to break the ABI if they ever get around to 
making builtin thread local specialized for their OS.
```
.Case("cxx_thread_local",
- LangOpts.CPlusPlus11 && 
PP.getTargetInfo().isTLSSupported() &&
- !PP.getTargetInfo().getTriple().isOSDarwin())
+ LangOpts.CPlusPlus11 && 
PP.getTargetInfo().isTLSSupported())
```

Otherwise this won't get coverage on OSX CI builds.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc issue #134: Orc 17

2017-08-15 Thread jamesclampffer
Github user jamesclampffer commented on the issue:

https://github.com/apache/orc/pull/134
  
Those errors should go away after HDFS-10787 is finished (or applied to the 
PR temporarily) since it exposes an API to get at the configuration without 
using std::tr2::optional.  We had to suppress a bunch of errors coming from 
optional when it was added to libhdfs++; eventually I'd like to get rid of the 
remaining uses of it in the implementation as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #134: Orc 17

2017-07-06 Thread jamesclampffer
Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r126017188
  
--- Diff: c++/libs/libhdfspp/CMakeLists.txt ---
@@ -0,0 +1,248 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# If cmake variable HDFSPP_LIBRARY_ONLY is set, then tests, examples, and
+# tools will not be built. This allows for faster builds of the libhdfspp
+# library alone, avoids looking for a JDK, valgrind, and gmock, and
+# prevents the generation of multiple binaries that might not be relevant
+# to other projects during normal use.
+# Example of cmake invocation with HDFSPP_LIBRARY_ONLY enabled:
+# cmake -DHDFSPP_LIBRARY_ONLY=1
+
+project (libhdfspp)
+
+cmake_minimum_required(VERSION 2.8)
+
+enable_testing()
+include (CTest)
+SET(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/CMake" 
${CMAKE_MODULE_PATH})
+
+# If there's a better way to inform FindCyrusSASL.cmake, let's make this 
cleaner:
+SET(CMAKE_PREFIX_PATH 
"${CMAKE_PREFIX_PATH};${CYRUS_SASL_DIR};${GSASL_DIR}")
+
+find_package(Doxygen)
+find_package(OpenSSL REQUIRED)
+find_package(Protobuf REQUIRED)
+find_package(CyrusSASL)
+find_package(GSasl)
--- End diff --

I think you're going to need a flag to make sure it doesn't attempt to find 
and link against GSasl since the Apache and GPL licenses aren't compatible.  
Eventually GSasl will be removed from libhdfs++ for the same reason.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] orc pull request #134: Orc 17

2017-07-06 Thread jamesclampffer
Github user jamesclampffer commented on a diff in the pull request:

https://github.com/apache/orc/pull/134#discussion_r125994365
  
--- Diff: c++/src/OrcHdfsFile.cc ---
@@ -34,15 +34,13 @@
 #include "common/hdfs_configuration.h"
 #include "common/configuration_loader.h"
 
-#define BUF_SIZE 1048576 //1 MB
-
 namespace orc {
 
   class HdfsFileInputStream : public InputStream {
--- End diff --

For a mock that does real network stuff libhdfs++ uses the C++/JVM bindings 
over the minidfscluster (java namenode and datanode in a single process) for 
its unit tests.  That seems like a very large dependency to bring into this 
project.  It'd be possible to write a mock that goes over TCP in C/C++, I can 
think of 2 ways:
1) Capture network traffic of a read with tcpdump/wireshark for a given 
request and send the response back.  This would be really brittle if IO request 
sizes changed and pretty ugly to store a bunch of binary data.
2) Implement at least some of the server side HDFS RPC protocol and 
DataTransferProtocol to make a mock namenode and datanode.  This is possible 
but would be a considerable amount of work that wouldn't be as useful for 
libhdfs++ since it already has the minidfscluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---