[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...
Github user jamesclampffer commented on the issue: https://github.com/apache/orc/pull/275 @majetideepak it looks like the build still fails with this in place. OrcHdfsFile.cc includes hdfspp.h which won't exist if libhdfs++ isn't extracted. ``` /apache_orc/build$ cmake .. -DBUILD_JAVA=OFF -- No build type selected, default to ReleaseWithDebugInfo -- compiler GNU version 4.9.4 -- Could NOT find CYRUS_SASL (missing: CYRUS_SASL_SHARED_LIB CYRUS_SASL_INCLUDE_DIR) -- WARNING: Libhdfs++ library was not built because the required CyrusSASL library was not found -- Configuring done -- Generating done -- Build files have been written to: /apache_orc/build :~/apache_orc/build$ make package [ 7%] Built target googletest_ep [ 14%] Built target snappy_ep [ 21%] Built target zlib_ep [ 28%] Built target lz4_ep [ 35%] Built target protobuf_ep [ 36%] Building CXX object c++/src/CMakeFiles/orc.dir/OrcHdfsFile.cc.o /apache_orc/c++/src/OrcHdfsFile.cc:32:27: fatal error: hdfspp/hdfspp.h: No such file or directory #include "hdfspp/hdfspp.h" ^ compilation terminated. c++/src/CMakeFiles/orc.dir/build.make:693: recipe for target 'c++/src/CMakeFiles/orc.dir/OrcHdfsFile.cc.o' failed ``` ---
[GitHub] orc issue #281: ORC-375: Update libhdfs to compile with gcc7.
Github user jamesclampffer commented on the issue: https://github.com/apache/orc/pull/281 @omalley I have a patch posted at https://issues.apache.org/jira/browse/HDFS-13534 but need another hadoop committer to review before I can push it. Do you want to take a look? ---
[GitHub] orc issue #275: ORC-371: [C++] Disable Libhdfspp build when Cyrus SASL is no...
Github user jamesclampffer commented on the issue: https://github.com/apache/orc/pull/275 Rather than skipping the entire libhdfs++ build you could set the NO_SASL parameter if openssl or cyrus is missing to build libhdfs++ without support for strong authentication. At one point someone added a partial implementation of SASL DIGEGT-MD5 directly in libhdfs++ and I think that may try to use SSL's RNG to generate nonces (and looks like NO_SASL might not prevent that). If that breaks the build with NO_SASL set we can get rid of that code in libhdfs++. Even if that implementation worked it'd prevent the library from being FIPS 140-2 compliant which is a deal breaker for my work. ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user jamesclampffer commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191935614 --- Diff: c++/src/RLE.hh --- @@ -68,7 +76,24 @@ namespace orc { * record current position * @param recorder use the recorder to record current positions */ -virtual void recordPosition(PositionRecorder* recorder) const = 0; +virtual void recordPosition(PositionRecorder* recorder) const; + + protected: +std::unique_ptr outputStream; +size_t bufferPosition; +size_t bufferLength; +size_t numLiterals; +int64_t* literals; +bool isSigned; +char* buffer; --- End diff -- Is this initialized anywhere? ---
[GitHub] orc pull request #273: ORC-343 Enable C++ writer to support RleV2
Github user jamesclampffer commented on a diff in the pull request: https://github.com/apache/orc/pull/273#discussion_r191932328 --- Diff: c++/include/orc/Writer.hh --- @@ -38,6 +38,11 @@ namespace orc { CompressionStrategy_COMPRESSION }; + enum RleVersion { +RleVersion_1, --- End diff -- You may want to explicitly assign values here to ensure the addition of new encodings or refactoring work can't unintentionally change the underlying integer values. ---
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user jamesclampffer commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r140077825 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc pull request #149: ORC-224: Implement column writers of primitive types
Github user jamesclampffer commented on a diff in the pull request: https://github.com/apache/orc/pull/149#discussion_r140067969 --- Diff: c++/src/ColumnWriter.cc --- @@ -468,25 +472,1099 @@ namespace orc { rleEncoder->recordPosition(rowIndexPosition.get()); } - std::unique_ptr buildWriter( -const Type& type, -const StreamsFactory& factory, -const WriterOptions& options) { -switch (static_cast(type.getKind())) { - case STRUCT: -return std::unique_ptr( - new StructColumnWriter( - type, - factory, - options)); - case INT: - case LONG: - case SHORT: -return std::unique_ptr( - new IntegerColumnWriter( - type, - factory, - options)); + class ByteColumnWriter : public ColumnWriter { + public: +ByteColumnWriter(const Type& type, + const StreamsFactory& factory, + const WriterOptions& options); + +virtual void add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) override; + +virtual void flush(std::vector& streams) override; + +virtual uint64_t getEstimatedSize() const override; + +virtual void getColumnEncoding( +std::vector& encodings) const override; + +virtual void recordPosition() const override; + + private: +std::unique_ptr byteRleEncoder; + }; + + ByteColumnWriter::ByteColumnWriter( +const Type& type, +const StreamsFactory& factory, +const WriterOptions& options) : + ColumnWriter(type, factory, options) { +std::unique_ptr dataStream = + factory.createStream(proto::Stream_Kind_DATA); +byteRleEncoder = createByteRleEncoder(std::move(dataStream)); + +if (enableIndex) { + recordPosition(); +} + } + + void ByteColumnWriter::add(ColumnVectorBatch& rowBatch, + uint64_t offset, + uint64_t numValues) { +ColumnWriter::add(rowBatch, offset, numValues); + +LongVectorBatch& byteBatch = + dynamic_cast<LongVectorBatch&>(rowBatch); + +int64_t* data = byteBatch.data.data() + offset; +const char* notNull = byteBatch.hasNulls ? + byteBatch.notNull.data() + offset : nullptr; + +char* byteData = reinterpret_cast<char*>(data); +for (uint64_t i = 0; i < numValues; ++i) { + byteData[i] = static_cast(data[i]); +} +byteRleEncoder->add(byteData, numValues, notNull); + +IntegerColumnStatisticsImpl* intStats = + dynamic_cast<IntegerColumnStatisticsImpl*>(colIndexStatistics.get()); +bool hasNull = false; +for (uint64_t i = 0; i < numValues; ++i) { + if (notNull == nullptr || notNull[i]) { +intStats->increase(1); +intStats->update(static_cast(byteData[i]), 1); + } else if (!hasNull) { +hasNull = true; + } +} +intStats->setHasNull(hasNull); + } + + void ByteColumnWriter::flush(std::vector& streams) { +ColumnWriter::flush(streams); + +proto::Stream stream; +stream.set_kind(proto::Stream_Kind_DATA); +stream.set_column(static_cast(columnId)); +stream.set_length(byteRleEncoder->flush()); +streams.push_back(stream); + } + + uint64_t ByteColumnWriter::getEstimatedSize() const { +uint64_t size = ColumnWriter::getEstimatedSize(); +size += byteRleEncoder->getBufferSize(); +return size; + } + + void ByteColumnWriter::getColumnEncoding( +std::vector& encodings) const { +proto::ColumnEncoding encoding; +encoding.set_kind(proto::ColumnEncoding_Kind_DIRECT); +encoding.set_dictionarysize(0); +encodings.push_back(encoding); + } + + void ByteColumnWriter::recordPosition() const { +ColumnWriter::recordPosition(); +byteRleEncoder->recordPosition(rowIndexPosition.get()); + } + + class BooleanColumnWriter : public ColumnWriter { + public: +BooleanColumnWriter(const Type
[GitHub] orc issue #134: Orc 17
Github user jamesclampffer commented on the issue: https://github.com/apache/orc/pull/134 @majetideepak Is there a way to get the OSX CI to use a generic LLVM/Clang install rather than the one that gets shipped with OSX? Apparently they apply a patch to Clang to break thread_local, my understanding is they don't want to break the ABI if they ever get around to making builtin thread local specialized for their OS. ``` .Case("cxx_thread_local", - LangOpts.CPlusPlus11 && PP.getTargetInfo().isTLSSupported() && - !PP.getTargetInfo().getTriple().isOSDarwin()) + LangOpts.CPlusPlus11 && PP.getTargetInfo().isTLSSupported()) ``` Otherwise this won't get coverage on OSX CI builds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc issue #134: Orc 17
Github user jamesclampffer commented on the issue: https://github.com/apache/orc/pull/134 Those errors should go away after HDFS-10787 is finished (or applied to the PR temporarily) since it exposes an API to get at the configuration without using std::tr2::optional. We had to suppress a bunch of errors coming from optional when it was added to libhdfs++; eventually I'd like to get rid of the remaining uses of it in the implementation as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #134: Orc 17
Github user jamesclampffer commented on a diff in the pull request: https://github.com/apache/orc/pull/134#discussion_r126017188 --- Diff: c++/libs/libhdfspp/CMakeLists.txt --- @@ -0,0 +1,248 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# If cmake variable HDFSPP_LIBRARY_ONLY is set, then tests, examples, and +# tools will not be built. This allows for faster builds of the libhdfspp +# library alone, avoids looking for a JDK, valgrind, and gmock, and +# prevents the generation of multiple binaries that might not be relevant +# to other projects during normal use. +# Example of cmake invocation with HDFSPP_LIBRARY_ONLY enabled: +# cmake -DHDFSPP_LIBRARY_ONLY=1 + +project (libhdfspp) + +cmake_minimum_required(VERSION 2.8) + +enable_testing() +include (CTest) +SET(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/CMake" ${CMAKE_MODULE_PATH}) + +# If there's a better way to inform FindCyrusSASL.cmake, let's make this cleaner: +SET(CMAKE_PREFIX_PATH "${CMAKE_PREFIX_PATH};${CYRUS_SASL_DIR};${GSASL_DIR}") + +find_package(Doxygen) +find_package(OpenSSL REQUIRED) +find_package(Protobuf REQUIRED) +find_package(CyrusSASL) +find_package(GSasl) --- End diff -- I think you're going to need a flag to make sure it doesn't attempt to find and link against GSasl since the Apache and GPL licenses aren't compatible. Eventually GSasl will be removed from libhdfs++ for the same reason. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] orc pull request #134: Orc 17
Github user jamesclampffer commented on a diff in the pull request: https://github.com/apache/orc/pull/134#discussion_r125994365 --- Diff: c++/src/OrcHdfsFile.cc --- @@ -34,15 +34,13 @@ #include "common/hdfs_configuration.h" #include "common/configuration_loader.h" -#define BUF_SIZE 1048576 //1 MB - namespace orc { class HdfsFileInputStream : public InputStream { --- End diff -- For a mock that does real network stuff libhdfs++ uses the C++/JVM bindings over the minidfscluster (java namenode and datanode in a single process) for its unit tests. That seems like a very large dependency to bring into this project. It'd be possible to write a mock that goes over TCP in C/C++, I can think of 2 ways: 1) Capture network traffic of a read with tcpdump/wireshark for a given request and send the response back. This would be really brittle if IO request sizes changed and pretty ugly to store a bunch of binary data. 2) Implement at least some of the server side HDFS RPC protocol and DataTransferProtocol to make a mock namenode and datanode. This is possible but would be a considerable amount of work that wouldn't be as useful for libhdfs++ since it already has the minidfscluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---