[GitHub] [arrow] cyb70289 opened a new pull request #7476: ARROW-9168: [C++][Flight] configure TCP connection sharing in benchmark
cyb70289 opened a new pull request #7476: URL: https://github.com/apache/arrow/pull/7476 Flight benchmark performs worse when working threads increases. It can be improved by setting gRPC option GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL to make each client creating its own TCP connection to server, without sharing one connection. This patch adds a cmdline parameter to configure TCP connection sharing. Default behaviour is to separate connection, which performs better when working threads increases. This patch also changes parameter "records_per_stream" from int32 to int64, to ease profiling by setting it to a very large value and keep benchmark running. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support
liyafan82 commented on pull request #7287: URL: https://github.com/apache/arrow/pull/7287#issuecomment-645762806 > That's odd. I wonder what's different about your build setup from the jobs we run on CI because I haven't seen that before. Do you think you could add a crossbow job that captures this build setup (is it just bundled boost and thrift, plus ARROW_FLIGHT=ON and tests on?) > > FWIW this boost include has been removed in Thrift 0.13: [apache/thrift@1f34504#diff-73a92ed6f7bb65d6e8f29f74ae6c94bf](https://github.com/apache/thrift/commit/1f34504f43a7a409364d4114f180762bf2679e57#diff-73a92ed6f7bb65d6e8f29f74ae6c94bf) > > So if/whenever we can [upgrade to 0.13](https://issues.apache.org/jira/browse/ARROW-8049), this particular header won't ever be invoked. I agree with you that a crossbow job for this build setup is extremely useful, as even if this problem is fixed, there are other problems. I will open an issue to track it. I was using the following flags: -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON -DARROW_FLIGHT=ON -DARROW_DATASET=ON -DBOOST_SOURCE=BUNDLED -DARROW_PARQUET=ON -DARROW_JNI=ON -DARROW_BUILD_EXAMPLES=ON -DCMAKE_CXX_COMPILER=.../gcc-7.2.0/bin/c++ -DCMAKE_C_COMPILER=.../gcc-7.2.0/bin/gcc This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r441931433 ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -31,19 +28,18 @@ public final long chunkSize; + /** + * default chunk size from Netty. + */ + private static final long DEFAULT_CHUNK_SIZE = 16777216L; Review comment: Maybe it is beneficial to add above discussion to the JavaDoc This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r441930746 ## File path: java/memory/src/main/java/org/apache/arrow/memory/util/MemoryUtil.java ## @@ -78,6 +77,55 @@ public Object run() { Field addressField = java.nio.Buffer.class.getDeclaredField("address"); addressField.setAccessible(true); BYTE_BUFFER_ADDRESS_OFFSET = UNSAFE.objectFieldOffset(addressField); + + Constructor directBufferConstructor; + long address = -1; + final ByteBuffer direct = ByteBuffer.allocateDirect(1); + try { + +final Object maybeDirectBufferConstructor = +AccessController.doPrivileged(new PrivilegedAction() { + @Override + public Object run() { +try { + final Constructor constructor = + direct.getClass().getDeclaredConstructor(long.class, int.class); + constructor.setAccessible(true); + return constructor; +} catch (NoSuchMethodException e) { + return e; +} catch (SecurityException e) { + return e; +} + } +}); + +if (maybeDirectBufferConstructor instanceof Constructor) { + address = UNSAFE.allocateMemory(1); + // try to use the constructor now + try { +((Constructor) maybeDirectBufferConstructor).newInstance(address, 1); +directBufferConstructor = (Constructor) maybeDirectBufferConstructor; +logger.debug("direct buffer constructor: available"); + } catch (InstantiationException e) { +directBufferConstructor = null; Review comment: looks good. thanks. since the exception handling logic is identical, does it make the code conciser to use the following semantics? ``` catch (InstantiationException | IllegalAccessException | InvocationTargetException e) { ... } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r441929412 ## File path: java/memory/src/main/java/io/netty/buffer/NettyArrowBuf.java ## @@ -404,7 +407,7 @@ protected int _getUnsignedMediumLE(int index) { this.chk(index, 3); long addr = this.addr(index); return PlatformDependent.getByte(addr) & 255 | -(Short.reverseBytes(PlatformDependent.getShort(addr + 1L)) & '\u') << 8; +(Short.reverseBytes(PlatformDependent.getShort(addr + 1L)) & '\u') << 8; Review comment: Thanks for your effort. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7396: ARROW-9092: [C++][TRIAGE] Do not enable TestRoundFunctions when using LLVM 9 until gandiva-decimal-test is fixed
kou commented on pull request #7396: URL: https://github.com/apache/arrow/pull/7396#issuecomment-645726115 Sorry. I couldn't reproduce this on my local environment including `docker-compose run conda-cpp` on my local machine: https://github.com/apache/arrow/pull/7280#issuecomment-635634955 But it's reproduced on GitHub Actions with conda https://github.com/apache/arrow/runs/721758545 and without conda https://github.com/apache/arrow/runs/721758560 . I looked into this a bit on GitHub Actions with https://github.com/mxschmitt/action-tmate . But I couldn't find what is the problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
kou commented on a change in pull request #7459: URL: https://github.com/apache/arrow/pull/7459#discussion_r441221973 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -62,12 +62,15 @@ endif() # Support C11 set(CMAKE_C_STANDARD 11) -# This ensures that things like gnu++11 get passed correctly -set(CMAKE_CXX_STANDARD 11) +# This ensures that things like c++11 get passed correctly +set(CMAKE_CXX_STANDARD ${ARROW_CXX_STANDARD}) Review comment: How about using `CMAKE_CXX_STANDARD` directly instead of introducing `ARROW_CXX_STANDARD`? ```diff diff --git a/cpp/cmake_modules/DefineOptions.cmake b/cpp/cmake_modules/DefineOptions.cmake index 660caf91c..f916781d7 100644 --- a/cpp/cmake_modules/DefineOptions.cmake +++ b/cpp/cmake_modules/DefineOptions.cmake @@ -82,8 +82,6 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") define_option_string(ARROW_CXXFLAGS "Compiler flags to append when compiling Arrow" "") - define_option(ARROW_CXX_STANDARD "C++ standard to target (default C++11)" 11) - define_option(ARROW_BUILD_STATIC "Build static libraries" ON) define_option(ARROW_BUILD_SHARED "Build shared libraries" ON) diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index bf7a60138..ac7518eba 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -63,7 +63,9 @@ endif() set(CMAKE_C_STANDARD 11) # This ensures that things like c++11 get passed correctly -set(CMAKE_CXX_STANDARD ${ARROW_CXX_STANDARD}) +if(NOT DEFINED CMAKE_CXX_STANDARD) + set(CMAKE_CXX_STANDARD 11) +endif() # We require a C++11 compliant compiler set(CMAKE_CXX_STANDARD_REQUIRED ON) ``` The https://github.com/apache/arrow/pull/7459#issuecomment-645047309 error will be fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7459: ARROW-6800: [C++] Support building libraries targeting C++14 or higher, disable GNU CXX extensions
kou commented on pull request #7459: URL: https://github.com/apache/arrow/pull/7459#issuecomment-645720173 Should we link to ARROW-6848 instead of ARROW-6800? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch
ursabot commented on pull request #7475: URL: https://github.com/apache/arrow/pull/7475#issuecomment-645714003 [AMD64 Ubuntu 18.04 C++ Benchmark (#113134)](https://ci.ursalabs.org/#builders/73/builds/84) builder failed with an exception. Revision: 999865b042c3131920b52b40a2387535168f3a08 Archery: `'archery benchmark ...'` step's traceback: ```pycon Traceback (most recent call last): File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py", line 1475, in gotResult _inlineCallbacks(r, g, status) File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) --- --- File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/buildbot/process/buildstep.py", line 566, in startStep self.results = yield self.run() File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/home/ursabot/ursabot/ursabot/steps.py", line 67, in run await log.addContent(content) File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/buildbot/process/log.py", line 130, in addContent return self.lbf.append(text) File "/home/ursabot/.conda/envs/ursabot/lib/python3.7/site-packages/buildbot/util/lineboundaries.py", line 62, in append text = self.newline_re.sub('\n', text) builtins.TypeError: expected string or bytes-like object ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch
github-actions[bot] commented on pull request #7475: URL: https://github.com/apache/arrow/pull/7475#issuecomment-645710548 https://issues.apache.org/jira/browse/ARROW-8500 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm closed pull request #7461: URL: https://github.com/apache/arrow/pull/7461 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645710004 I gave up on trying to have e.g. a common "64-bit" kernel for Equals/NotEquals Int64/UInt64/Timestamp/etc. The sticking point is scalar unboxing. We might need to fashion a common internal base class for primitive types and give them a virtual method that returns their data as `const uint8_t*` that you can cast to whatever primitive type you want This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch
wesm opened a new pull request #7475: URL: https://github.com/apache/arrow/pull/7475 Since I changed Filter on RecordBatch to transform the filter to indices and use Take, I wanted to have a benchmark to compare the before/after performance so this can also be monitored over time. These benchmarks could use some refactoring but this is at least a starting point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7475: ARROW-8500: [C++] Add benchmark for using Filter on RecordBatch
wesm commented on pull request #7475: URL: https://github.com/apache/arrow/pull/7475#issuecomment-645709018 @ursabot benchmark --benchmark-filter=FilterRecordBatch 22f374102 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
wesm commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645698551 I just merged my changes for the ASCII kernels making those work on sliced arrays This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7465: PARQUET-1877: [C++] Reconcile thrift limits
wesm closed pull request #7465: URL: https://github.com/apache/arrow/pull/7465 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels
wesm closed pull request #7458: URL: https://github.com/apache/arrow/pull/7458 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7458: ARROW-9122: [C++] Properly handle sliced arrays in ascii_lower, ascii_upper kernels
wesm commented on pull request #7458: URL: https://github.com/apache/arrow/pull/7458#issuecomment-645697268 the test failure is an s390x flake This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings
wesm closed pull request #7463: URL: https://github.com/apache/arrow/pull/7463 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7463: ARROW-9145: [C++] Implement BooleanArray::true_count and false_count, add Python bindings
wesm commented on pull request #7463: URL: https://github.com/apache/arrow/pull/7463#issuecomment-645696772 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645696688 +1. I'm going to merge this to help avoid conflicts caused by the stuff I just renamed. I welcome further comments and I will work to address them in follow ups This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441902495 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -485,18 +502,44 @@ struct ScalarUnaryNotNullStateful { struct ArrayExec> { static void Exec(const ThisType& functor, KernelContext* ctx, const ExecBatch& batch, Datum* out) { - typename TypeTraits::BuilderType builder; - VisitArrayValuesInline(*batch[0].array(), [&](util::optional v) { -if (v.has_value()) { - KERNEL_RETURN_IF_ERROR(ctx, builder.Append(functor.op.Call(ctx, *v))); -} else { - KERNEL_RETURN_IF_ERROR(ctx, builder.AppendNull()); + // TODO: This code path is currently inadequately tested. + + using offset_type = typename Type::offset_type; Review comment: I'm reverting this change since no kernel uses it and adding a comment explaining that it is suboptimal. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
bkietz commented on a change in pull request #7156: URL: https://github.com/apache/arrow/pull/7156#discussion_r441836943 ## File path: python/pyarrow/_dataset.pyx ## @@ -43,6 +44,51 @@ def _forbid_instantiation(klass, subclasses_instead=True): raise TypeError(msg) +cdef class FileSource: + +cdef: +CFileSource wrapped + +def __cinit__(self, file, FileSystem filesystem=None): +cdef: +shared_ptr[CFileSystem] c_filesystem +c_string c_path +shared_ptr[CRandomAccessFile] c_file +shared_ptr[CBuffer] c_buffer + +if isinstance(file, FileSource): Review comment: Since we've decided not to export `FileSource` I think we can probably delete this class in a follow up (in favor of `cdef CFileSource _make_file_source(...)` or so) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on pull request #7461: URL: https://github.com/apache/arrow/pull/7461#issuecomment-645623392 I'm revamping the documentation about these codegen functions which I'm dubbing "Generator-Dispatchers" (GDs) for short. I'll add "Generate" to their name. Stay tuned This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels
wesm commented on pull request #7473: URL: https://github.com/apache/arrow/pull/7473#issuecomment-645615268 https://issues.apache.org/jira/browse/ARROW-9164 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs edited a comment on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
kszucs edited a comment on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645614037 Quickly went through it and it LGTM. I can give it a more thorough review tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
kszucs commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645614037 Quickly went through it and LGTM. I can give it a more thorough review tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
wesm commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645612814 Very reasonable. Thanks @pitrou! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels
kszucs commented on pull request #7473: URL: https://github.com/apache/arrow/pull/7473#issuecomment-645612711 I was thinking of the same, so I eventually deferred the docstrings which I can add in my other PR where we need to add `check_overflow` flag to these functions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels
wesm closed pull request #7473: URL: https://github.com/apache/arrow/pull/7473 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645600060 +1. Thanks all for the comments This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm closed pull request #7442: URL: https://github.com/apache/arrow/pull/7442 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection
fsaintjacques commented on pull request #7474: URL: https://github.com/apache/arrow/pull/7474#issuecomment-645595500 @jorisvandenbossche I'm puzzled by the error. Could it be an upstream pandas change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs edited a comment on pull request #7439: ARROW-4309: [Documentation] Add a docker-compose entry which builds the documentation with CUDA enabled
kszucs edited a comment on pull request #7439: URL: https://github.com/apache/arrow/pull/7439#issuecomment-645593482 Please verify it with the following command: ```bash $ archery docker run --no-pull ubuntu-cuda-cpp $ archery docker run --no-pull ubuntu-cuda-python $ archery docker run --no-pull ubuntu-cuda-docs ``` Note that it requires docker version >=19.03. I successfully tested all three images. @wesm @kou perhaps you have necessary devices. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7439: ARROW-4309: [Documentation] Add a docker-compose entry which builds the documentation with CUDA enabled
kszucs commented on pull request #7439: URL: https://github.com/apache/arrow/pull/7439#issuecomment-645593482 Please verify it with the following command: ```bash $ archery docker run --no-pull ubuntu-cuda-cpp $ archery docker run --no-pull ubuntu-cuda-python $ archery docker run --no-pull ubuntu-cuda-docs ``` Note that it requires docker version >=19.03. I successfully tested all three images. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
ursabot commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645588035 [AMD64 Ubuntu 18.04 C++ Benchmark (#113048)](https://ci.ursalabs.org/#builders/73/builds/83) builder has been succeeded. Revision: 54bb83848d391477c5ded222fd4401acbe08c6c7 ```diff === === === benchmarkbaseline contender change === === === FilterStringFilterWithNulls/262144/9 395.928 MiB/sec 397.664 MiB/sec 0.439% FilterInt64FilterWithNulls/262144/0 621.828 MiB/sec 613.884 MiB/sec -1.277% FilterStringFilterWithNulls/262144/10578.179 MiB/sec 577.449 MiB/sec -0.126% FilterFSLInt64FilterWithNulls/262144/14 4.068 GiB/sec4.018 GiB/sec -1.247% FilterInt64FilterWithNulls/262144/13 604.515 MiB/sec 575.481 MiB/sec -4.803% FilterFSLInt64FilterNoNulls/262144/13350.875 MiB/sec 355.061 MiB/sec 1.193% FilterStringFilterWithNulls/262144/0 441.188 MiB/sec 442.379 MiB/sec 0.270% FilterInt64FilterWithNulls/262144/7 623.569 MiB/sec 594.423 MiB/sec -4.674% FilterStringFilterWithNulls/262144/1273.925 MiB/sec 73.930 MiB/sec 0.007% FilterStringFilterNoNulls/262144/3 548.889 MiB/sec 548.269 MiB/sec -0.113% FilterInt64FilterNoNulls/262144/07.942 GiB/sec8.079 GiB/sec 1.727% FilterInt64FilterNoNulls/262144/63.827 GiB/sec3.725 GiB/sec -2.665% FilterStringFilterWithNulls/262144/2 9.138 GiB/sec9.205 GiB/sec 0.726% FilterFSLInt64FilterWithNulls/262144/13 385.938 MiB/sec 370.599 MiB/sec -3.975% FilterInt64FilterWithNulls/262144/9 549.281 MiB/sec 542.112 MiB/sec -1.305% FilterInt64FilterWithNulls/262144/2 5.253 GiB/sec5.047 GiB/sec -3.918% FilterFSLInt64FilterNoNulls/262144/5 5.778 GiB/sec5.676 GiB/sec -1.761% FilterStringFilterNoNulls/262144/1 711.705 MiB/sec 697.941 MiB/sec -1.934% FilterStringFilterNoNulls/262144/0 560.111 MiB/sec 560.315 MiB/sec 0.036% FilterStringFilterWithNulls/262144/5 8.773 GiB/sec8.976 GiB/sec 2.318% FilterInt64FilterWithNulls/262144/11 4.863 GiB/sec4.942 GiB/sec 1.631% FilterFSLInt64FilterWithNulls/262144/11 4.145 GiB/sec4.089 GiB/sec -1.362% FilterInt64FilterNoNulls/262144/27.854 GiB/sec7.609 GiB/sec -3.117% FilterStringFilterNoNulls/262144/11 9.751 GiB/sec9.565 GiB/sec -1.904% FilterStringFilterNoNulls/262144/7 641.570 MiB/sec 650.710 MiB/sec 1.425% FilterStringFilterWithNulls/262144/3 435.185 MiB/sec 436.932 MiB/sec 0.401% FilterFSLInt64FilterNoNulls/262144/145.202 GiB/sec5.302 GiB/sec 1.915% FilterInt64FilterNoNulls/262144/4674.907 MiB/sec 654.585 MiB/sec -3.011% FilterInt64FilterNoNulls/262144/57.023 GiB/sec6.971 GiB/sec -0.741% FilterInt64FilterWithNulls/262144/12 548.203 MiB/sec 542.909 MiB/sec -0.966% FilterFSLInt64FilterNoNulls/262144/10387.772 MiB/sec 390.564 MiB/sec 0.720% FilterInt64FilterWithNulls/262144/8 4.951 GiB/sec5.094 GiB/sec 2.880% FilterStringFilterNoNulls/262144/13 90.750 MiB/sec 91.694 MiB/sec 1.040% FilterFSLInt64FilterWithNulls/262144/12 230.292 MiB/sec 263.113 MiB/sec 14.252% FilterStringFilterNoNulls/262144/12 70.772 MiB/sec 70.740 MiB/sec -0.044% FilterStringFilterWithNulls/262144/14927.254 MiB/sec 925.791 MiB/sec -0.158% FilterStringFilterNoNulls/262144/5 10.587 GiB/sec 10.322 GiB/sec -2.509% FilterFSLInt64FilterNoNulls/262144/3 551.473 MiB/sec 556.816 MiB/sec 0.969% FilterInt64FilterNoNulls/262144/14 6.302 GiB/sec6.848 GiB/sec 8.656% FilterInt64FilterWithNulls/262144/14 4.804 GiB/sec4.945 GiB/sec 2.933% FilterStringFilterNoNulls/262144/14 1.257 GiB/sec1.247 GiB/sec -0.814% FilterFSLInt64FilterNoNulls/262144/6 399.266 MiB/sec 402.455 MiB/sec 0.799% FilterInt64FilterWithNulls/262144/5 5.037 GiB/sec4.954 GiB/sec -1.645% FilterFSLInt64FilterNoNulls/262144/8 5.576 GiB/sec5.576 GiB/sec -0.004% FilterFSLInt64FilterNoNulls/262144/7 462.231 MiB/sec 456.668 MiB/sec -1.203% FilterFSLInt64FilterNoNulls/262144/115.377 GiB/sec5.381 GiB/sec 0.082% FilterStringFilterNoNulls/262144/6 487.645 MiB/sec 487.464 MiB/sec -0.037% FilterStringFilterNoNulls/262144/4 687.214 MiB/sec 678.019 MiB/sec -1.338% FilterFSLInt64FilterWithNulls/262144/9 287.916 MiB/sec 285.805 MiB/sec -0.733% FilterInt64FilterNoNulls/262144/93.245 GiB/sec3.126
[GitHub] [arrow] github-actions[bot] commented on pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection
github-actions[bot] commented on pull request #7474: URL: https://github.com/apache/arrow/pull/7474#issuecomment-645570471 https://issues.apache.org/jira/browse/ARROW-8802 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645569875 @ursabot benchmark --benchmark-filter=Filter 04006ff This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques opened a new pull request #7474: ARROW-8802: [C++][Dataset] Preserve Dataset's schema metadata on column projection
fsaintjacques opened a new pull request #7474: URL: https://github.com/apache/arrow/pull/7474 Scanner does not preserve the original schema metadata when columns are projected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
pitrou commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645562851 This grows the manylinux wheel sizes by about 2-3 MB (from ~13MB to ~15-16MB). It think it's reasonable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
pitrou commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645560128 It seems like everything is working fine here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645556193 So these "readability" improvements made performance worse so I'll revert them This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
ursabot commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645545062 [AMD64 Ubuntu 18.04 C++ Benchmark (#112989)](https://ci.ursalabs.org/#builders/73/builds/82) builder has been succeeded. Revision: 21227cc7530e59a481f7e3c0aae8d351b4226e9d ```diff === === benchmarkbaseline contender change === === - FilterStringFilterNoNulls/262144/7 637.909 MiB/sec 572.355 MiB/sec -10.276% - FilterStringFilterNoNulls/262144/8 10.897 GiB/sec 8.711 GiB/sec -20.057% FilterStringFilterNoNulls/262144/6 485.775 MiB/sec 476.410 MiB/sec -1.928% FilterStringFilterWithNulls/262144/4 649.558 MiB/sec 677.796 MiB/sec 4.347% FilterInt64FilterNoNulls/262144/93.212 GiB/sec3.264 GiB/sec 1.612% - FilterFSLInt64FilterNoNulls/262144/13351.877 MiB/sec 308.073 MiB/sec -12.449% - FilterFSLInt64FilterNoNulls/262144/10389.471 MiB/sec 333.418 MiB/sec -14.392% - FilterInt64FilterNoNulls/262144/4668.729 MiB/sec 625.199 MiB/sec -6.509% FilterFSLInt64FilterWithNulls/262144/9 287.988 MiB/sec 276.495 MiB/sec -3.991% - FilterStringFilterWithNulls/262144/2 9.441 GiB/sec8.793 GiB/sec -6.865% FilterStringFilterWithNulls/262144/1273.855 MiB/sec 82.463 MiB/sec 11.656% - FilterFSLInt64FilterNoNulls/262144/5 6.091 GiB/sec4.403 GiB/sec -27.714% - FilterFSLInt64FilterNoNulls/262144/3 550.519 MiB/sec 463.959 MiB/sec -15.723% FilterInt64FilterNoNulls/262144/27.988 GiB/sec7.976 GiB/sec -0.147% - FilterStringFilterNoNulls/262144/4 700.795 MiB/sec 605.189 MiB/sec -13.643% - FilterFSLInt64FilterWithNulls/262144/1 516.544 MiB/sec 460.521 MiB/sec -10.846% - FilterStringFilterWithNulls/262144/8 8.877 GiB/sec8.364 GiB/sec -5.779% - FilterFSLInt64FilterWithNulls/262144/3 350.123 MiB/sec 329.103 MiB/sec -6.004% FilterStringFilterWithNulls/262144/3 435.836 MiB/sec 494.167 MiB/sec 13.384% FilterInt64FilterNoNulls/262144/10 630.544 MiB/sec 628.104 MiB/sec -0.387% - FilterStringFilterNoNulls/262144/5 11.014 GiB/sec 8.788 GiB/sec -20.216% FilterInt64FilterNoNulls/262144/34.263 GiB/sec4.181 GiB/sec -1.936% FilterInt64FilterWithNulls/262144/1 635.637 MiB/sec 615.015 MiB/sec -3.244% FilterStringFilterWithNulls/262144/7 638.645 MiB/sec 678.465 MiB/sec 6.235% - FilterFSLInt64FilterNoNulls/262144/2 6.506 GiB/sec5.012 GiB/sec -22.975% - FilterFSLInt64FilterNoNulls/262144/0 729.854 MiB/sec 569.623 MiB/sec -21.954% FilterInt64FilterNoNulls/262144/56.946 GiB/sec6.899 GiB/sec -0.674% FilterInt64FilterWithNulls/262144/12 545.763 MiB/sec 547.657 MiB/sec 0.347% FilterStringFilterNoNulls/262144/9 383.858 MiB/sec 377.178 MiB/sec -1.740% - FilterFSLInt64FilterNoNulls/262144/8 5.825 GiB/sec4.702 GiB/sec -19.289% FilterInt64FilterNoNulls/262144/13 632.053 MiB/sec 633.157 MiB/sec 0.175% FilterInt64FilterNoNulls/262144/11.020 GiB/sec1.022 GiB/sec 0.239% - FilterFSLInt64FilterNoNulls/262144/12242.197 MiB/sec 228.152 MiB/sec -5.799% FilterInt64FilterWithNulls/262144/4 640.980 MiB/sec 614.192 MiB/sec -4.179% FilterInt64FilterWithNulls/262144/8 4.967 GiB/sec5.071 GiB/sec 2.102% - FilterFSLInt64FilterWithNulls/262144/0 396.373 MiB/sec 374.388 MiB/sec -5.546% FilterInt64FilterWithNulls/262144/11 4.934 GiB/sec4.997 GiB/sec 1.282% - FilterFSLInt64FilterNoNulls/262144/145.435 GiB/sec4.459 GiB/sec -17.946% FilterInt64FilterNoNulls/262144/12 3.255 GiB/sec3.185 GiB/sec -2.144% FilterStringFilterWithNulls/262144/1 638.704 MiB/sec 690.413 MiB/sec 8.096% - FilterStringFilterNoNulls/262144/2 11.411 GiB/sec 9.040 GiB/sec -20.778% FilterInt64FilterWithNulls/262144/6 582.753 MiB/sec 554.462 MiB/sec -4.855% FilterStringFilterWithNulls/262144/10586.149 MiB/sec 616.404 MiB/sec 5.162% FilterInt64FilterNoNulls/262144/07.653 GiB/sec7.971 GiB/sec 4.146% FilterInt64FilterWithNulls/262144/13 590.396 MiB/sec 607.816 MiB/sec 2.951% - FilterStringFilterNoNulls/262144/14 1.254 GiB/sec1011.778 MiB/sec -21.233% - FilterFSLInt64FilterWithNulls/262144/4 474.573 MiB/sec 428.073 MiB/sec -9.798% FilterInt64FilterWithNulls/262144/2 5.245 GiB/sec5.072 GiB/sec -3.310% - FilterStringFilterWithNulls/262144/118.381 GiB/sec7.793 GiB/sec -7.006%
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7156: ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles
fsaintjacques commented on a change in pull request #7156: URL: https://github.com/apache/arrow/pull/7156#discussion_r441740202 ## File path: python/pyarrow/_dataset.pyx ## @@ -43,6 +44,51 @@ def _forbid_instantiation(klass, subclasses_instead=True): raise TypeError(msg) +cdef class FileSource: + +cdef: +CFileSource wrapped + +def __cinit__(self, file, FileSystem filesystem=None): +cdef: +shared_ptr[CFileSystem] c_filesystem +c_string c_path +shared_ptr[CRandomAccessFile] c_file +shared_ptr[CBuffer] c_buffer + +if isinstance(file, FileSource): Review comment: This is odd. Should you have `_ensure_file_source` like other methods? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
kszucs commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645539612 Added `libutf8proc` dependency to the ursabot builders, same could be done for the docker-compose images. The tests are failing though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
kszucs commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645538795 @ursabot build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645526968 @ursabot benchmark --benchmark-filter=Filter 04006ff This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
github-actions[bot] commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645526238 Revision: f28701736f39939dc7eafb3db2e093612ed3c10c Submitted crossbow builds: [ursa-labs/crossbow @ actions-328](https://github.com/ursa-labs/crossbow/branches/all?query=actions-328) |Task|Status| ||--| |wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp35m)| |wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp36m)| |wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp37m)| |wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux1-cp38)| |wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp35m)| |wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp36m)| |wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp37m)| |wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2010-cp38)| |wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp35m)| |wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp36m)| |wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp37m)| |wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-328-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-328-azure-wheel-manylinux2014-cp38)| |wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-328-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-328-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)| |wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-328-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
pitrou commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645525436 @github-actions crossbow submit -g wheel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645521577 Something weird with the commit history, I'm not sure those benchmarks are right. I'll rebase things again and rerun This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
ursabot commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645518289 [AMD64 Ubuntu 18.04 C++ Benchmark (#112952)](https://ci.ursalabs.org/#builders/73/builds/81) builder has been succeeded. Revision: f50b39e54c50e8a53606eda486c88e6ec51d7006 ```diff === === benchmarkbaseline contender change === === - FilterFSLInt64FilterNoNulls/262144/145.457 GiB/sec4.398 GiB/sec -19.404% FilterStringFilterWithNulls/262144/4 642.405 MiB/sec 677.920 MiB/sec 5.528% - FilterFSLInt64FilterNoNulls/262144/7 463.992 MiB/sec 378.391 MiB/sec -18.449% FilterFSLInt64FilterWithNulls/262144/6 333.996 MiB/sec 320.327 MiB/sec -4.093% - FilterFSLInt64FilterWithNulls/262144/1 516.189 MiB/sec 459.926 MiB/sec -10.900% - FilterStringFilterNoNulls/262144/4 681.504 MiB/sec 595.788 MiB/sec -12.577% - FilterFSLInt64FilterNoNulls/262144/8 5.889 GiB/sec4.675 GiB/sec -20.610% - FilterInt64FilterWithNulls/262144/10 606.960 MiB/sec 547.973 MiB/sec -9.718% - FilterInt64FilterNoNulls/262144/7638.264 MiB/sec 568.923 MiB/sec -10.864% FilterStringFilterWithNulls/262144/6 431.474 MiB/sec 484.077 MiB/sec 12.191% - FilterStringFilterNoNulls/262144/14 1.245 GiB/sec1008.386 MiB/sec -20.893% FilterFSLInt64FilterWithNulls/262144/11 4.239 GiB/sec4.029 GiB/sec -4.954% - FilterStringFilterNoNulls/262144/8 10.899 GiB/sec 8.494 GiB/sec -22.064% - FilterFSLInt64FilterNoNulls/262144/4 515.626 MiB/sec 406.426 MiB/sec -21.178% FilterInt64FilterNoNulls/262144/63.697 GiB/sec3.525 GiB/sec -4.664% FilterInt64FilterNoNulls/262144/86.829 GiB/sec6.809 GiB/sec -0.301% - FilterFSLInt64FilterNoNulls/262144/2 6.453 GiB/sec4.950 GiB/sec -23.289% - FilterInt64FilterWithNulls/262144/13 606.984 MiB/sec 548.948 MiB/sec -9.561% - FilterStringFilterNoNulls/262144/1 707.132 MiB/sec 609.027 MiB/sec -13.874% FilterStringFilterWithNulls/262144/3 436.301 MiB/sec 488.825 MiB/sec 12.038% FilterStringFilterWithNulls/262144/1 616.105 MiB/sec 675.493 MiB/sec 9.639% FilterStringFilterNoNulls/262144/3 548.660 MiB/sec 533.539 MiB/sec -2.756% - FilterFSLInt64FilterNoNulls/262144/9 268.363 MiB/sec 250.359 MiB/sec -6.709% - FilterStringFilterNoNulls/262144/13 89.995 MiB/sec 76.326 MiB/sec -15.189% FilterStringFilterWithNulls/262144/1271.366 MiB/sec 82.415 MiB/sec 15.483% FilterInt64FilterNoNulls/262144/93.209 GiB/sec3.114 GiB/sec -2.971% FilterFSLInt64FilterWithNulls/262144/9 288.819 MiB/sec 276.679 MiB/sec -4.203% FilterStringFilterNoNulls/262144/12 66.141 MiB/sec 65.509 MiB/sec -0.956% - FilterFSLInt64FilterWithNulls/262144/4 474.907 MiB/sec 429.013 MiB/sec -9.664% - FilterInt64FilterWithNulls/262144/1 651.659 MiB/sec 556.258 MiB/sec -14.640% FilterStringFilterWithNulls/262144/14911.019 MiB/sec 871.756 MiB/sec -4.310% - FilterInt64FilterNoNulls/262144/4675.941 MiB/sec 569.448 MiB/sec -15.755% - FilterFSLInt64FilterNoNulls/262144/13352.227 MiB/sec 307.638 MiB/sec -12.659% FilterInt64FilterWithNulls/262144/5 5.129 GiB/sec4.921 GiB/sec -4.068% - FilterFSLInt64FilterWithNulls/262144/14 4.168 GiB/sec3.909 GiB/sec -6.200% FilterStringFilterWithNulls/262144/9 396.156 MiB/sec 442.591 MiB/sec 11.721% - FilterFSLInt64FilterNoNulls/262144/3 554.664 MiB/sec 464.787 MiB/sec -16.204% - FilterStringFilterNoNulls/262144/2 11.394 GiB/sec 8.924 GiB/sec -21.683% - FilterStringFilterWithNulls/262144/8 8.856 GiB/sec8.075 GiB/sec -8.825% - FilterFSLInt64FilterNoNulls/262144/10389.368 MiB/sec 333.033 MiB/sec -14.468% - FilterFSLInt64FilterNoNulls/262144/115.587 GiB/sec4.507 GiB/sec -19.338% FilterStringFilterWithNulls/262144/10580.314 MiB/sec 612.106 MiB/sec 5.478% - FilterFSLInt64FilterNoNulls/262144/5 6.032 GiB/sec4.717 GiB/sec -21.802% - FilterFSLInt64FilterNoNulls/262144/0 725.211 MiB/sec 565.535 MiB/sec -22.018% - FilterInt64FilterNoNulls/262144/34.266 GiB/sec3.855 GiB/sec -9.641% - FilterInt64FilterWithNulls/262144/12 549.159 MiB/sec 499.761 MiB/sec -8.995% - FilterInt64FilterWithNulls/262144/0 622.810 MiB/sec 497.075 MiB/sec -20.188% - FilterInt64FilterNoNulls/262144/11.021 GiB/sec980.686 MiB/sec -6.230% - FilterFSLInt64FilterWithNulls/262144/0 399.890 MiB/sec 375.677
[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
github-actions[bot] commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645507606 Revision: 282de25836ec86c707c34340c5a0f4907392e178 Submitted crossbow builds: [ursa-labs/crossbow @ actions-327](https://github.com/ursa-labs/crossbow/branches/all?query=actions-327) |Task|Status| ||--| |wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp35m)| |wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp36m)| |wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp37m)| |wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux1-cp38)| |wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp35m)| |wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp36m)| |wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp37m)| |wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2010-cp38)| |wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp35m)| |wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp36m)| |wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp37m)| |wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-327-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-327-azure-wheel-manylinux2014-cp38)| |wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-327-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-327-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)| |wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-327-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
pitrou commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645502945 @github-actions crossbow submit -g wheel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7469: ARROW-8832: [Python] Provide better error message when S3/HDFS is not enabled in installation
pitrou commented on a change in pull request #7469: URL: https://github.com/apache/arrow/pull/7469#discussion_r441596426 ## File path: python/pyarrow/fs.py ## @@ -35,14 +36,31 @@ # For backward compatibility. FileStats = FileInfo +_not_imported = [] + try: from pyarrow._hdfs import HadoopFileSystem # noqa except ImportError: -pass +_not_imported.append("HadoopFileSystem") try: from pyarrow._s3fs import S3FileSystem, initialize_s3, finalize_s3 # noqa except ImportError: -pass +_not_imported.append("S3FileSystem") else: initialize_s3() + + +if sys.version_info >= (3, 7): Review comment: I don't think this is required. At worse `__getattr__` is a no-op :-) ## File path: python/pyarrow/fs.py ## @@ -35,14 +36,31 @@ # For backward compatibility. FileStats = FileInfo +_not_imported = [] + try: from pyarrow._hdfs import HadoopFileSystem # noqa except ImportError: -pass +_not_imported.append("HadoopFileSystem") try: from pyarrow._s3fs import S3FileSystem, initialize_s3, finalize_s3 # noqa except ImportError: -pass +_not_imported.append("S3FileSystem") else: initialize_s3() + + +if sys.version_info >= (3, 7): + +def __getattr__(name): + Review comment: Spurious empty line? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson closed pull request #7415: ARROW-7028: [R] Date roundtrip results in different R storage mode
nealrichardson closed pull request #7415: URL: https://github.com/apache/arrow/pull/7415 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645498297 I think I improved some of the readability problems and addressed the other comments. I'd like to merge this soon once CI is creen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm edited a comment on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645498297 I think I improved some of the readability problems and addressed the other comments. I'd like to merge this soon once CI is green This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on pull request #7442: URL: https://github.com/apache/arrow/pull/7442#issuecomment-645497918 @ursabot benchmark --benchmark-filter=Filter c4f425768 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nevi-me closed pull request #7399: ARROW-9095: [Rust] Spec-compliant NullArray
nevi-me closed pull request #7399: URL: https://github.com/apache/arrow/pull/7399 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels
github-actions[bot] commented on pull request #7473: URL: https://github.com/apache/arrow/pull/7473#issuecomment-645490171 https://issues.apache.org/jira/browse/ARROW-9162 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs opened a new pull request #7473: ARROW-9162: [Python] Expose Add/Subtract/Multiply arithmetic kernels
kszucs opened a new pull request #7473: URL: https://github.com/apache/arrow/pull/7473 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441674808 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); +while (position < filter.length) { + BitBlockCount block = bit_counter.NextFourWords(); + output_size += block.popcount; + position += block.length; +} + } + return output_size; +} + +template +Result> GetTakeIndicesImpl( +const ArrayData& filter, FilterOptions::NullSelectionBehavior null_selection, +MemoryPool* memory_pool) { + using T = typename IndexType::c_type; + typename TypeTraits::BuilderType builder(memory_pool); + + const uint8_t* filter_data = filter.buffers[1]->data(); + BitBlockCounter data_counter(filter_data, filter.offset, filter.length); + + // The position relative to the start of the filter + T position = 0; + + // The current position taking the filter offset into account + int64_t position_with_offset = filter.offset; + if (filter.GetNullCount() > 0) { +// The filter has nulls, so we scan the validity bitmap and the filter data +// bitmap together, branching on the null selection type. +const uint8_t* filter_is_valid = filter.buffers[0]->data(); + +// To count blocks whether filter_data[i] || !filter_is_valid[i] +BinaryBitBlockCounter filter_counter(filter_data, filter.offset, filter_is_valid, + filter.offset, filter.length); +if (null_selection == FilterOptions::DROP) { + while (position < filter.length) { +BitBlockCount and_block = filter_counter.NextAndWord(); +RETURN_NOT_OK(builder.Reserve(and_block.popcount)); +if (and_block.AllSet()) { + // All the values are selected and non-null + for (int64_t i = 0; i < and_block.length; ++i) { +builder.UnsafeAppend(position++); + } + position_with_offset += and_block.length; +
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441668163 ## File path: cpp/src/arrow/compute/api_vector.h ## @@ -64,6 +67,24 @@ Result Filter(const Datum& values, const Datum& filter, const FilterOptions& options = FilterOptions::Defaults(), ExecContext* ctx = NULLPTR); +namespace internal { + +// These internal functions are implemented in kernels/vector_selection.cc + +/// \brief Return the number of selected indices in the boolean filter +ARROW_EXPORT +int64_t GetFilterOutputSize(const ArrayData& filter, Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441662165 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); +while (position < filter.length) { + BitBlockCount block = bit_counter.NextFourWords(); + output_size += block.popcount; + position += block.length; +} + } + return output_size; +} + +template +Result> GetTakeIndicesImpl( +const ArrayData& filter, FilterOptions::NullSelectionBehavior null_selection, +MemoryPool* memory_pool) { + using T = typename IndexType::c_type; + typename TypeTraits::BuilderType builder(memory_pool); + + const uint8_t* filter_data = filter.buffers[1]->data(); + BitBlockCounter data_counter(filter_data, filter.offset, filter.length); + + // The position relative to the start of the filter + T position = 0; + + // The current position taking the filter offset into account + int64_t position_with_offset = filter.offset; + if (filter.GetNullCount() > 0) { +// The filter has nulls, so we scan the validity bitmap and the filter data +// bitmap together, branching on the null selection type. +const uint8_t* filter_is_valid = filter.buffers[0]->data(); + +// To count blocks whether filter_data[i] || !filter_is_valid[i] +BinaryBitBlockCounter filter_counter(filter_data, filter.offset, filter_is_valid, + filter.offset, filter.length); +if (null_selection == FilterOptions::DROP) { + while (position < filter.length) { +BitBlockCount and_block = filter_counter.NextAndWord(); +RETURN_NOT_OK(builder.Reserve(and_block.popcount)); +if (and_block.AllSet()) { + // All the values are selected and non-null + for (int64_t i = 0; i < and_block.length; ++i) { +builder.UnsafeAppend(position++); + } + position_with_offset += and_block.length; +
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441661392 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); +while (position < filter.length) { + BitBlockCount block = bit_counter.NextFourWords(); + output_size += block.popcount; + position += block.length; +} + } + return output_size; +} + +template +Result> GetTakeIndicesImpl( +const ArrayData& filter, FilterOptions::NullSelectionBehavior null_selection, +MemoryPool* memory_pool) { + using T = typename IndexType::c_type; + typename TypeTraits::BuilderType builder(memory_pool); + + const uint8_t* filter_data = filter.buffers[1]->data(); + BitBlockCounter data_counter(filter_data, filter.offset, filter.length); + + // The position relative to the start of the filter + T position = 0; + + // The current position taking the filter offset into account + int64_t position_with_offset = filter.offset; + if (filter.GetNullCount() > 0) { +// The filter has nulls, so we scan the validity bitmap and the filter data +// bitmap together, branching on the null selection type. +const uint8_t* filter_is_valid = filter.buffers[0]->data(); + +// To count blocks whether filter_data[i] || !filter_is_valid[i] +BinaryBitBlockCounter filter_counter(filter_data, filter.offset, filter_is_valid, + filter.offset, filter.length); +if (null_selection == FilterOptions::DROP) { + while (position < filter.length) { +BitBlockCount and_block = filter_counter.NextAndWord(); +RETURN_NOT_OK(builder.Reserve(and_block.popcount)); +if (and_block.AllSet()) { + // All the values are selected and non-null + for (int64_t i = 0; i < and_block.length; ++i) { +builder.UnsafeAppend(position++); + } + position_with_offset += and_block.length; +
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441660992 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); +while (position < filter.length) { + BitBlockCount block = bit_counter.NextFourWords(); + output_size += block.popcount; + position += block.length; +} + } + return output_size; +} + +template +Result> GetTakeIndicesImpl( +const ArrayData& filter, FilterOptions::NullSelectionBehavior null_selection, +MemoryPool* memory_pool) { + using T = typename IndexType::c_type; + typename TypeTraits::BuilderType builder(memory_pool); + + const uint8_t* filter_data = filter.buffers[1]->data(); + BitBlockCounter data_counter(filter_data, filter.offset, filter.length); + + // The position relative to the start of the filter + T position = 0; + + // The current position taking the filter offset into account + int64_t position_with_offset = filter.offset; + if (filter.GetNullCount() > 0) { +// The filter has nulls, so we scan the validity bitmap and the filter data +// bitmap together, branching on the null selection type. +const uint8_t* filter_is_valid = filter.buffers[0]->data(); + +// To count blocks whether filter_data[i] || !filter_is_valid[i] +BinaryBitBlockCounter filter_counter(filter_data, filter.offset, filter_is_valid, + filter.offset, filter.length); +if (null_selection == FilterOptions::DROP) { + while (position < filter.length) { +BitBlockCount and_block = filter_counter.NextAndWord(); +RETURN_NOT_OK(builder.Reserve(and_block.popcount)); +if (and_block.AllSet()) { + // All the values are selected and non-null + for (int64_t i = 0; i < and_block.length; ++i) { +builder.UnsafeAppend(position++); + } + position_with_offset += and_block.length; +
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441660523 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, Review comment: Maybe. Let's leave this for follow up because I would need some benchmarks to be comfortable doing this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441659681 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); Review comment: Changing to use CountSetBits, which should be faster This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441658427 ## File path: cpp/src/arrow/testing/random.cc ## @@ -84,7 +84,7 @@ std::shared_ptr RandomArrayGenerator::Boolean(int64_t size, double probab BufferVector buffers{2}; // Need 2 distinct generators such that probabilities are not shared. - GenOpt value_gen(seed(), 0, 1, probability); + GenOpt value_gen(seed(), 0, 1, 1 - probability); Review comment: Done, and renamed the parameters to be clear that it's the "true" probaility This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441656051 ## File path: cpp/src/arrow/compute/kernels/util_internal.h ## @@ -0,0 +1,50 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +#include "arrow/buffer.h" + +namespace arrow { +namespace compute { +namespace internal { + +// An internal data structure for unpacking a primitive argument to pass to a +// kernel implementation +struct PrimitiveArg { + const uint8_t* is_valid; + const uint8_t* data; Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
github-actions[bot] commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645462522 Revision: 0517d82791270284f4004ffe7812f08f2bb35235 Submitted crossbow builds: [ursa-labs/crossbow @ actions-326](https://github.com/ursa-labs/crossbow/branches/all?query=actions-326) |Task|Status| ||--| |wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp35m)| |wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp36m)| |wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp37m)| |wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux1-cp38)| |wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp35m)| |wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp36m)| |wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp37m)| |wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2010-cp38)| |wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp35m)| |wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp36m)| |wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp37m)| |wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-326-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-326-azure-wheel-manylinux2014-cp38)| |wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-326-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-326-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)| |wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-326-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
[GitHub] [arrow] kszucs closed pull request #6512: ARROW-8430: [CI] Configure self-hosted runners for Github Actions
kszucs closed pull request #6512: URL: https://github.com/apache/arrow/pull/6512 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
pitrou commented on pull request #7471: URL: https://github.com/apache/arrow/pull/7471#issuecomment-645461657 @github-actions crossbow submit -g wheel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7471: ARROW-9109: [Python][Packaging] Enable S3 support in manylinux wheels
pitrou opened a new pull request #7471: URL: https://github.com/apache/arrow/pull/7471 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441643717 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); +while (position < filter.length) { + BitBlockCount block = bit_counter.NextFourWords(); + output_size += block.popcount; + position += block.length; +} + } + return output_size; +} + +template +Result> GetTakeIndicesImpl( +const ArrayData& filter, FilterOptions::NullSelectionBehavior null_selection, +MemoryPool* memory_pool) { + using T = typename IndexType::c_type; + typename TypeTraits::BuilderType builder(memory_pool); + + const uint8_t* filter_data = filter.buffers[1]->data(); + BitBlockCounter data_counter(filter_data, filter.offset, filter.length); + + // The position relative to the start of the filter + T position = 0; + + // The current position taking the filter offset into account + int64_t position_with_offset = filter.offset; + if (filter.GetNullCount() > 0) { Review comment: Someone can do it as follow up (and keeping an eye on the benchmarks to avoid perf regressions), it doesn't feel like a good use of my time when there are so many things to do for 1.0.0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7468: ARROW-8283: [Python] Limit FileSystemDataset constructor from fragments/paths, no filesystem interaction
fsaintjacques commented on a change in pull request #7468: URL: https://github.com/apache/arrow/pull/7468#discussion_r441638604 ## File path: python/pyarrow/_dataset.pyx ## @@ -407,42 +407,82 @@ cdef class UnionDataset(Dataset): cdef class FileSystemDataset(Dataset): -"""A Dataset created from a set of files on a particular filesystem. +"""A Dataset of file fragments. + +A FileSystemDataset is composed of one or more FileFragment. Parameters -- -paths_or_selector : Union[FileSelector, List[FileInfo]] -List of files/directories to consume. +fragments : list[Fragments] +List of fragments to consume. schema : Schema -The top-level schema of the DataDataset. +The top-level schema of the Dataset. format : FileFormat -File format to create fragments from, currently only -ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported. -filesystem : FileSystem -The filesystem which files are from. -partitions : List[Expression], optional -Attach additional partition information for the file paths. +File format of the fragments, currently only ParquetFileFormat, +IpcFileFormat, and CsvFileFormat are supported. root_partition : Expression, optional The top-level partition of the DataDataset. """ cdef: CFileSystemDataset* filesystem_dataset -def __init__(self, paths_or_selector, schema=None, format=None, - filesystem=None, partitions=None, root_partition=None): +def __init__(self, fragments, Schema schema, FileFormat format, + root_partition=None): cdef: -FileInfo info -Expression expr FileFragment fragment -vector[CFileInfo] c_file_infos -vector[shared_ptr[CExpression]] c_partitions -shared_ptr[CFileFragment] c_fragment vector[shared_ptr[CFileFragment]] c_fragments CResult[shared_ptr[CDataset]] result root_partition = root_partition or _true +if not isinstance(root_partition, Expression): +raise TypeError( +"Argument 'root_partition' has incorrect type (expected " +"Epression, got {0})".format(type(root_partition)) +) + +for fragment in fragments: +c_fragments.push_back( +static_pointer_cast[CFileFragment, CFragment]( +fragment.unwrap())) + +result = CFileSystemDataset.Make( +pyarrow_unwrap_schema(schema), +( root_partition).unwrap(), +( format).unwrap(), +c_fragments +) +self.init(GetResultValue(result)) + +cdef void init(self, const shared_ptr[CDataset]& sp): +Dataset.init(self, sp) +self.filesystem_dataset = sp.get() + +@classmethod +def from_paths(cls, paths, schema=None, format=None, Review comment: Schema shouldn't be optional, add a test if the user doesn't pass a schema to see the result. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory
fsaintjacques commented on a change in pull request #7437: URL: https://github.com/apache/arrow/pull/7437#discussion_r441633483 ## File path: python/pyarrow/tests/test_dataset.py ## @@ -1522,19 +1522,20 @@ def _create_parquet_dataset_partitioned(root_path): @pytest.mark.parquet @pytest.mark.pandas def test_parquet_dataset_factory_partitioned(tempdir): -# TODO support for specifying partitioning scheme - root_path = tempdir / "test_parquet_dataset_factory_partitioned" metadata_path, table = _create_parquet_dataset_partitioned(root_path) -dataset = ds.parquet_dataset(metadata_path) -# TODO partition column not yet included -# assert dataset.schema.equals(table.schema) +partitioning = ds.partitioning(flavor="hive") +dataset = ds.parquet_dataset(metadata_path, + partitioning=partitioning, + partition_base_dir=str(root_path)) Review comment: It is not needed anymore and fixed :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques closed pull request #7438: ARROW-9105: [C++][Dataset][Python] Pass an explicit schema to split_by_row_groups
fsaintjacques closed pull request #7438: URL: https://github.com/apache/arrow/pull/7438 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
bkietz commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r440923999 ## File path: cpp/src/arrow/compute/api_vector.h ## @@ -64,6 +67,24 @@ Result Filter(const Datum& values, const Datum& filter, const FilterOptions& options = FilterOptions::Defaults(), ExecContext* ctx = NULLPTR); +namespace internal { + +// These internal functions are implemented in kernels/vector_selection.cc + +/// \brief Return the number of selected indices in the boolean filter +ARROW_EXPORT +int64_t GetFilterOutputSize(const ArrayData& filter, Review comment: This should probably be extracted as a ScalarFunction named popcount or so (follow up) ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); Review comment: If this is faster than `CountSetBits`, the latter should probably be rewritten with `BitBlockCounter` ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include
[GitHub] [arrow] wesm commented on pull request #7396: ARROW-9092: [C++][TRIAGE] Do not enable TestRoundFunctions when using LLVM 9 until gandiva-decimal-test is fixed
wesm commented on pull request #7396: URL: https://github.com/apache/arrow/pull/7396#issuecomment-645434797 It occurs for me every time. My toolchain environment is somewhat complex (because I'm using some conda packages) so I'll see if I can provide a reproduction for you (it could take me a while to get back to you). FWIW I believe @kou encountered this problem also, probably in a more vanilla toolchain environment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7437: ARROW-8943: [C++][Python][Dataset] Add partitioning support to ParquetDatasetFactory
fsaintjacques commented on a change in pull request #7437: URL: https://github.com/apache/arrow/pull/7437#discussion_r441621088 ## File path: cpp/src/arrow/dataset/file_parquet.h ## @@ -215,6 +215,34 @@ class ARROW_DS_EXPORT ParquetFileFragment : public FileFragment { friend class ParquetFileFormat; }; +struct ParquetFactoryOptions { + // Either an explicit Partitioning or a PartitioningFactory to discover one. + // + // If a factory is provided, it will be used to infer a schema for partition fields + // based on file and directory paths then construct a Partitioning. The default + // is a Partitioning which will yield no partition information. + // + // The (explicit or discovered) partitioning will be applied to discovered files + // and the resulting partition information embedded in the Dataset. + PartitioningOrFactory partitioning{Partitioning::Default()}; + + // For the purposes of applying the partitioning, paths will be stripped + // of the partition_base_dir. Files not matching the partition_base_dir + // prefix will be skipped for partition discovery. The ignored files will still + // be part of the Dataset, but will not have partition information. + // + // Example: + // partition_base_dir = "/dataset"; + // + // - "/dataset/US/sales.csv" -> "US/sales.csv" will be given to the partitioning + // + // - "/home/john/late_sales.csv" -> Will be ignored for partition discovery. + // + // This is useful for partitioning which parses directory when ordering + // is important, e.g. DirectoryPartitioning. + std::string partition_base_dir; Review comment: That's a good point. I'll follow what FileSystemFactory does with the selector base's path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7287: ARROW-8771: [C++] Add boost/process library to build support
nealrichardson commented on pull request #7287: URL: https://github.com/apache/arrow/pull/7287#issuecomment-645432605 That's odd. I wonder what's different about your build setup from the jobs we run on CI because I haven't seen that before. Do you think you could add a crossbow job that captures this build setup (is it just bundled boost and thrift, plus ARROW_FLIGHT=ON and tests on?) FWIW this boost include has been removed in Thrift 0.13: https://github.com/apache/thrift/commit/1f34504f43a7a409364d4114f180762bf2679e57#diff-73a92ed6f7bb65d6e8f29f74ae6c94bf So if/whenever we can [upgrade to 0.13](https://issues.apache.org/jira/browse/ARROW-8049), this particular header won't ever be invoked. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441613340 ## File path: cpp/src/arrow/compute/kernels/util_internal.cc ## @@ -0,0 +1,61 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/util_internal.h" + +#include + +#include "arrow/array/data.h" +#include "arrow/type.h" +#include "arrow/util/checked_cast.h" + +namespace arrow { + +using internal::checked_cast; + +namespace compute { +namespace internal { + +const uint8_t* GetValidityBitmap(const ArrayData& data) { + const uint8_t* bitmap = nullptr; + if (data.buffers[0]) { +bitmap = data.buffers[0]->data(); + } + return bitmap; +} + +int GetBitWidth(const DataType& type) { + return checked_cast(type).bit_width(); +} + +PrimitiveArg GetPrimitiveArg(const ArrayData& arr) { + PrimitiveArg arg; + arg.is_valid = GetValidityBitmap(arr); + arg.data = arr.buffers[1]->data(); + arg.bit_width = GetBitWidth(*arr.type); + arg.offset = arr.offset; + arg.length = arr.length; + if (arg.bit_width > 1) { +arg.data += arr.offset * arg.bit_width / 8; + } + arg.null_count = arr.GetNullCount(); Review comment: I think now that I've fixed the problem in https://github.com/apache/arrow/commit/37c9804784325502bf47b651252c39bdcf3e03a9 I don't need to compute it here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441613579 ## File path: cpp/src/arrow/compute/kernels/vector_selection_test.cc ## @@ -0,0 +1,1637 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/compute/api.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/table.h" +#include "arrow/testing/gtest_common.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/testing/util.h" + +namespace arrow { + +using internal::checked_cast; +using internal::checked_pointer_cast; +using util::string_view; + +namespace compute { + +// -- +// Some random data generation helpers + +template +std::shared_ptr RandomNumeric(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + return rng->Numeric(length, 0, 127, null_probability); +} + +std::shared_ptr RandomBoolean(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + return rng->Boolean(length, 0.5, null_probability); +} + +std::shared_ptr RandomString(int64_t length, double null_probability, +random::RandomArrayGenerator* rng) { + return rng->String(length, 0, 32, null_probability); +} + +std::shared_ptr RandomLargeString(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + return rng->LargeString(length, 0, 32, null_probability); +} + +std::shared_ptr RandomFixedSizeBinary(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + const int32_t value_size = 16; + int64_t data_nbytes = length * value_size; + std::shared_ptr data = *AllocateBuffer(data_nbytes); + random_bytes(data_nbytes, /*seed=*/0, data->mutable_data()); + auto validity = rng->Boolean(length, 1 - null_probability); + + // Assemble the data for a FixedSizeBinaryArray + auto values_data = std::make_shared(fixed_size_binary(value_size), length); + values_data->buffers = {validity->data()->buffers[1], data}; + return MakeArray(values_data); +} + +// -- + +TEST(GetTakeIndices, Basics) { + auto CheckCase = [&](const std::string& filter_json, const std::string& indices_json, + FilterOptions::NullSelectionBehavior null_selection, + const std::shared_ptr& indices_type = uint16()) { +auto filter = ArrayFromJSON(boolean(), filter_json); +auto expected_indices = ArrayFromJSON(indices_type, indices_json); +ASSERT_OK_AND_ASSIGN(auto indices, + internal::GetTakeIndices(*filter->data(), null_selection)); +AssertArraysEqual(*expected_indices, *MakeArray(indices), /*verbose=*/true); + }; + + // Drop null cases + CheckCase("[]", "[]", FilterOptions::DROP); + CheckCase("[null]", "[]", FilterOptions::DROP); + CheckCase("[null, false, true, true, false, true]", "[2, 3, 5]", FilterOptions::DROP); + + // Emit null cases + CheckCase("[]", "[]", FilterOptions::EMIT_NULL); + CheckCase("[null]", "[null]", FilterOptions::EMIT_NULL); + CheckCase("[null, false, true, true]", "[null, 2, 3]", FilterOptions::EMIT_NULL); +} + +// TODO: Add slicing + +template +void CheckGetTakeIndicesCase(const Array& untyped_filter) { + const auto& filter = checked_cast(untyped_filter); + ASSERT_OK_AND_ASSIGN(std::shared_ptr drop_indices, + internal::GetTakeIndices(*filter.data(), FilterOptions::DROP)); + // Verify DROP indices + { +IndexArrayType indices(drop_indices); +int64_t out_position = 0; +for (int64_t i = 0; i < filter.length(); ++i) { + if (filter.IsValid(i)) { +if (filter.Value(i)) { + ASSERT_EQ(indices.Value(out_position), i); + ++out_position; +} + } +} +// Check that the end length agrees with the output of
[GitHub] [arrow] wesm commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
wesm commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441612459 ## File path: cpp/src/arrow/compute/kernels/vector_selection.cc ## @@ -0,0 +1,1816 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/array/array_base.h" +#include "arrow/array/array_binary.h" +#include "arrow/array/array_dict.h" +#include "arrow/array/array_nested.h" +#include "arrow/array/builder_primitive.h" +#include "arrow/array/concatenate.h" +#include "arrow/buffer_builder.h" +#include "arrow/compute/api_vector.h" +#include "arrow/compute/kernels/common.h" +#include "arrow/compute/kernels/util_internal.h" +#include "arrow/extension_type.h" +#include "arrow/record_batch.h" +#include "arrow/result.h" +#include "arrow/util/bit_block_counter.h" +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" +#include "arrow/util/bitmap_reader.h" +#include "arrow/util/int_util.h" + +namespace arrow { + +using internal::BinaryBitBlockCounter; +using internal::BitBlockCount; +using internal::BitBlockCounter; +using internal::BitmapReader; +using internal::CopyBitmap; +using internal::GetArrayView; +using internal::IndexBoundsCheck; +using internal::OptionalBitBlockCounter; +using internal::OptionalBitIndexer; + +namespace compute { +namespace internal { + +int64_t GetFilterOutputSize(const ArrayData& filter, +FilterOptions::NullSelectionBehavior null_selection) { + int64_t output_size = 0; + int64_t position = 0; + if (filter.GetNullCount() > 0) { +const uint8_t* filter_is_valid = filter.buffers[0]->data(); +BinaryBitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, + filter_is_valid, filter.offset, filter.length); +if (null_selection == FilterOptions::EMIT_NULL) { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextOrNotWord(); +output_size += block.popcount; +position += block.length; + } +} else { + while (position < filter.length) { +BitBlockCount block = bit_counter.NextAndWord(); +output_size += block.popcount; +position += block.length; + } +} + } else { +// The filter has no nulls, so we plow through its data as fast as +// possible. +BitBlockCounter bit_counter(filter.buffers[1]->data(), filter.offset, filter.length); +while (position < filter.length) { + BitBlockCount block = bit_counter.NextFourWords(); + output_size += block.popcount; + position += block.length; +} + } + return output_size; +} + +template +Result> GetTakeIndicesImpl( +const ArrayData& filter, FilterOptions::NullSelectionBehavior null_selection, +MemoryPool* memory_pool) { + using T = typename IndexType::c_type; + typename TypeTraits::BuilderType builder(memory_pool); + + const uint8_t* filter_data = filter.buffers[1]->data(); + BitBlockCounter data_counter(filter_data, filter.offset, filter.length); + + // The position relative to the start of the filter + T position = 0; + + // The current position taking the filter offset into account + int64_t position_with_offset = filter.offset; + if (filter.GetNullCount() > 0) { +// The filter has nulls, so we scan the validity bitmap and the filter data +// bitmap together, branching on the null selection type. +const uint8_t* filter_is_valid = filter.buffers[0]->data(); + +// To count blocks whether filter_data[i] || !filter_is_valid[i] +BinaryBitBlockCounter filter_counter(filter_data, filter.offset, filter_is_valid, + filter.offset, filter.length); +if (null_selection == FilterOptions::DROP) { + while (position < filter.length) { +BitBlockCount and_block = filter_counter.NextAndWord(); +RETURN_NOT_OK(builder.Reserve(and_block.popcount)); +if (and_block.AllSet()) { + // All the values are selected and non-null + for (int64_t i = 0; i < and_block.length; ++i) { +builder.UnsafeAppend(position++); + } + position_with_offset += and_block.length; +
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441609892 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -807,11 +885,30 @@ ArrayKernelExec SignedInteger(detail::GetTypeId get_id) { } } -// Generate a kernel given a templated functor for base binary types +// Generate a kernel given a templated functor for base binary types. Generates +// a single kernel for binary/string and large binary / large string. If your +// kernel implementation needs access to the specific type at compile time, +// please use BaseBinarySpecific. // // See "Numeric" above for description of the generator functor template class Generator, typename Type0, typename... Args> ArrayKernelExec BaseBinary(detail::GetTypeId get_id) { Review comment: I'll move these "generator-dispatchers" to a `generate::` namespace for clarity. I thought that `codegen::` was clear enough but I guess not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] fsaintjacques commented on a change in pull request #7442: ARROW-9075: [C++] Optimized Filter implementation: faster performance + compilation, smaller code size
fsaintjacques commented on a change in pull request #7442: URL: https://github.com/apache/arrow/pull/7442#discussion_r441606010 ## File path: cpp/src/arrow/compute/kernels/vector_selection_test.cc ## @@ -0,0 +1,1637 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/compute/api.h" +#include "arrow/compute/kernels/test_util.h" +#include "arrow/table.h" +#include "arrow/testing/gtest_common.h" +#include "arrow/testing/gtest_util.h" +#include "arrow/testing/random.h" +#include "arrow/testing/util.h" + +namespace arrow { + +using internal::checked_cast; +using internal::checked_pointer_cast; +using util::string_view; + +namespace compute { + +// -- +// Some random data generation helpers + +template +std::shared_ptr RandomNumeric(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + return rng->Numeric(length, 0, 127, null_probability); +} + +std::shared_ptr RandomBoolean(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + return rng->Boolean(length, 0.5, null_probability); +} + +std::shared_ptr RandomString(int64_t length, double null_probability, +random::RandomArrayGenerator* rng) { + return rng->String(length, 0, 32, null_probability); +} + +std::shared_ptr RandomLargeString(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + return rng->LargeString(length, 0, 32, null_probability); +} + +std::shared_ptr RandomFixedSizeBinary(int64_t length, double null_probability, + random::RandomArrayGenerator* rng) { + const int32_t value_size = 16; + int64_t data_nbytes = length * value_size; + std::shared_ptr data = *AllocateBuffer(data_nbytes); + random_bytes(data_nbytes, /*seed=*/0, data->mutable_data()); + auto validity = rng->Boolean(length, 1 - null_probability); + + // Assemble the data for a FixedSizeBinaryArray + auto values_data = std::make_shared(fixed_size_binary(value_size), length); + values_data->buffers = {validity->data()->buffers[1], data}; + return MakeArray(values_data); +} + +// -- + +TEST(GetTakeIndices, Basics) { + auto CheckCase = [&](const std::string& filter_json, const std::string& indices_json, + FilterOptions::NullSelectionBehavior null_selection, + const std::shared_ptr& indices_type = uint16()) { +auto filter = ArrayFromJSON(boolean(), filter_json); +auto expected_indices = ArrayFromJSON(indices_type, indices_json); +ASSERT_OK_AND_ASSIGN(auto indices, + internal::GetTakeIndices(*filter->data(), null_selection)); +AssertArraysEqual(*expected_indices, *MakeArray(indices), /*verbose=*/true); + }; + + // Drop null cases + CheckCase("[]", "[]", FilterOptions::DROP); + CheckCase("[null]", "[]", FilterOptions::DROP); + CheckCase("[null, false, true, true, false, true]", "[2, 3, 5]", FilterOptions::DROP); + + // Emit null cases + CheckCase("[]", "[]", FilterOptions::EMIT_NULL); + CheckCase("[null]", "[null]", FilterOptions::EMIT_NULL); + CheckCase("[null, false, true, true]", "[null, 2, 3]", FilterOptions::EMIT_NULL); +} + +// TODO: Add slicing + +template +void CheckGetTakeIndicesCase(const Array& untyped_filter) { + const auto& filter = checked_cast(untyped_filter); + ASSERT_OK_AND_ASSIGN(std::shared_ptr drop_indices, + internal::GetTakeIndices(*filter.data(), FilterOptions::DROP)); + // Verify DROP indices + { +IndexArrayType indices(drop_indices); +int64_t out_position = 0; +for (int64_t i = 0; i < filter.length(); ++i) { + if (filter.IsValid(i)) { +if (filter.Value(i)) { + ASSERT_EQ(indices.Value(out_position), i); + ++out_position; +} + } +} +// Check that the end length agrees with the output of
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441606788 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -485,18 +502,44 @@ struct ScalarUnaryNotNullStateful { struct ArrayExec> { static void Exec(const ThisType& functor, KernelContext* ctx, const ExecBatch& batch, Datum* out) { - typename TypeTraits::BuilderType builder; - VisitArrayValuesInline(*batch[0].array(), [&](util::optional v) { -if (v.has_value()) { - KERNEL_RETURN_IF_ERROR(ctx, builder.Append(functor.op.Call(ctx, *v))); -} else { - KERNEL_RETURN_IF_ERROR(ctx, builder.AppendNull()); + // TODO: This code path is currently inadequately tested. + + using offset_type = typename Type::offset_type; Review comment: The validity bitmap is already precomputed (zero-copied, in fact) by the executor so using the builder naively here is computationally wasteful This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441606788 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -485,18 +502,44 @@ struct ScalarUnaryNotNullStateful { struct ArrayExec> { static void Exec(const ThisType& functor, KernelContext* ctx, const ExecBatch& batch, Datum* out) { - typename TypeTraits::BuilderType builder; - VisitArrayValuesInline(*batch[0].array(), [&](util::optional v) { -if (v.has_value()) { - KERNEL_RETURN_IF_ERROR(ctx, builder.Append(functor.op.Call(ctx, *v))); -} else { - KERNEL_RETURN_IF_ERROR(ctx, builder.AppendNull()); + // TODO: This code path is currently inadequately tested. + + using offset_type = typename Type::offset_type; Review comment: The validity bitmap is already precomputed (zero-copied, in fact) by the executor so using the builder natively here is computationally wasteful This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441605587 ## File path: cpp/src/arrow/compute/kernels/scalar_compare.cc ## @@ -54,72 +56,106 @@ struct GreaterEqual { } }; -struct Less { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left < right; - } -}; - -struct LessEqual { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left <= right; - } -}; +// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual -template -void AddCompare(const std::shared_ptr& ty, ScalarFunction* func) { - ArrayKernelExec exec = codegen::ScalarBinaryEqualTypes::Exec; - DCHECK_OK(func->AddKernel({ty, ty}, boolean(), exec)); +template Review comment: Generator and Op are not the same thing in this nomenclature. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441605587 ## File path: cpp/src/arrow/compute/kernels/scalar_compare.cc ## @@ -54,72 +56,106 @@ struct GreaterEqual { } }; -struct Less { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left < right; - } -}; - -struct LessEqual { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left <= right; - } -}; +// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual -template -void AddCompare(const std::shared_ptr& ty, ScalarFunction* func) { - ArrayKernelExec exec = codegen::ScalarBinaryEqualTypes::Exec; - DCHECK_OK(func->AddKernel({ty, ty}, boolean(), exec)); +template Review comment: Generator and Op are not the same thing in codegen_internal.h -- I agree with giving consistent names to isomorphic concepts, though. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441605115 ## File path: cpp/src/arrow/type.h ## @@ -1251,6 +1256,7 @@ class ARROW_EXPORT TimestampType : public TemporalType, public ParametricType { static constexpr Type::type type_id = Type::TIMESTAMP; using c_type = int64_t; + using StorageType = Int64Type; Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
wesm commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441604605 ## File path: cpp/src/arrow/scalar.h ## @@ -237,19 +246,17 @@ struct ARROW_EXPORT FixedSizeBinaryScalar : public BinaryScalar { explicit FixedSizeBinaryScalar(std::shared_ptr type) : BinaryScalar(type) {} }; -template -struct ARROW_EXPORT TemporalScalar : public Scalar { - using Scalar::Scalar; +template +struct ARROW_EXPORT TemporalScalar : internal::PrimitiveScalar { Review comment: Int64Scalar and TimestampScalar need to have a common base type in order to both be unboxed by templated code that uses Int64Type in the kernel codegen. So `PrimitiveScalar` won't work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] itamarst commented on pull request #7169: ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas
itamarst commented on pull request #7169: URL: https://github.com/apache/arrow/pull/7169#issuecomment-645414175 Great, ready to merge then? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
bkietz commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441584777 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -807,11 +885,30 @@ ArrayKernelExec SignedInteger(detail::GetTypeId get_id) { } } -// Generate a kernel given a templated functor for base binary types +// Generate a kernel given a templated functor for base binary types. Generates +// a single kernel for binary/string and large binary / large string. If your +// kernel implementation needs access to the specific type at compile time, +// please use BaseBinarySpecific. // // See "Numeric" above for description of the generator functor template class Generator, typename Type0, typename... Args> ArrayKernelExec BaseBinary(detail::GetTypeId get_id) { Review comment: ```suggestion ArrayKernelExec GenerateForBaseBinary(detail::GetTypeId get_id) { ``` ## File path: cpp/src/arrow/type.h ## @@ -1251,6 +1256,7 @@ class ARROW_EXPORT TimestampType : public TemporalType, public ParametricType { static constexpr Type::type type_id = Type::TIMESTAMP; using c_type = int64_t; + using StorageType = Int64Type; Review comment: ```suggestion using PhysicalType = Int64Type; ``` ? ## File path: cpp/src/arrow/compute/kernels/scalar_compare.cc ## @@ -54,72 +56,106 @@ struct GreaterEqual { } }; -struct Less { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left < right; - } -}; - -struct LessEqual { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left <= right; - } -}; +// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual Review comment: ```suggestion // Implement Less, LessEqual by flipping arguments to Greater, GreaterEqual ``` ## File path: cpp/src/arrow/compute/kernels/scalar_compare.cc ## @@ -54,72 +56,106 @@ struct GreaterEqual { } }; -struct Less { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left < right; - } -}; - -struct LessEqual { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left <= right; - } -}; +// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual -template -void AddCompare(const std::shared_ptr& ty, ScalarFunction* func) { - ArrayKernelExec exec = codegen::ScalarBinaryEqualTypes::Exec; - DCHECK_OK(func->AddKernel({ty, ty}, boolean(), exec)); +template Review comment: It would improve readability if we formalized the `Op` concept and named them consistently. Even within this PR these are also called `Generator`s. It's not consistently made clear what functions they're expected to provide. https://issues.apache.org/jira/browse/ARROW-9161 ## File path: cpp/src/arrow/scalar.h ## @@ -237,19 +246,17 @@ struct ARROW_EXPORT FixedSizeBinaryScalar : public BinaryScalar { explicit FixedSizeBinaryScalar(std::shared_ptr type) : BinaryScalar(type) {} }; -template -struct ARROW_EXPORT TemporalScalar : public Scalar { - using Scalar::Scalar; +template +struct ARROW_EXPORT TemporalScalar : internal::PrimitiveScalar { Review comment: I'm not sure this use of inheritance is kosher. This implies `TimestampScalar isa Int64Scalar`, which is a relationship not present in parallel hierarchies (Array, DataType, Builder) and is therefore likely to be confusing. Why not use `TemporalScalar : PrimitiveScalar`? ## File path: cpp/src/arrow/compute/kernels/scalar_compare.cc ## @@ -54,72 +56,106 @@ struct GreaterEqual { } }; -struct Less { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left < right; - } -}; - -struct LessEqual { - template - static constexpr bool Call(KernelContext*, const T& left, const T& right) { -return left <= right; - } -}; +// Implement Greater, GreaterEqual by flipping arguments to Less, LessEqual -template -void AddCompare(const std::shared_ptr& ty, ScalarFunction* func) { - ArrayKernelExec exec = codegen::ScalarBinaryEqualTypes::Exec; - DCHECK_OK(func->AddKernel({ty, ty}, boolean(), exec)); +template +void AddIntegerCompare(const std::shared_ptr& ty, ScalarFunction* func) { + auto exec = + codegen::IntegerBased(*ty); + DCHECK_OK(func->AddKernel({ty, ty}, boolean(), std::move(exec))); } -template -void AddTimestampComparisons(ScalarFunction* func) { - ArrayKernelExec exec = - codegen::ScalarBinaryEqualTypes::Exec; - for (auto unit : {TimeUnit::SECOND, TimeUnit::MILLI, TimeUnit::MICRO, TimeUnit::NANO}) { -InputType in_type(match::TimestampUnit(unit)); -DCHECK_OK(func->AddKernel({in_type, in_type}, boolean(), exec)); - } +template +void AddGenericCompare(const std::shared_ptr& ty,
[GitHub] [arrow] itamarst commented on a change in pull request #7169: ARROW-5359: [Python] Support non-nanosecond out-of-range timestamps in conversion to pandas
itamarst commented on a change in pull request #7169: URL: https://github.com/apache/arrow/pull/7169#discussion_r441594967 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1688,8 +1698,12 @@ static Status GetPandasWriterType(const ChunkedArray& data, const PandasOptions& break; case Type::TIMESTAMP: { const auto& ts_type = checked_cast(*data.type()); - // XXX: Hack here for ARROW-7723 - if (ts_type.timezone() != "" && !options.ignore_timezone) { + if (options.timestamp_as_object && ts_type.unit() != TimeUnit::NANO) { +// Nanoseconds are never out of bounds for pandas, so in that case Review comment: Oh I see what you mean. OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
wesm commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645413587 I also agree with inlining the utf8proc functions until utf8proc can be patched to have better performance. I doubt that these optimizations will meaningfully impact the macroperformance of applications This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7410: ARROW-971: [C++][Compute] IsValid, IsNull kernels
wesm commented on a change in pull request #7410: URL: https://github.com/apache/arrow/pull/7410#discussion_r441588468 ## File path: cpp/src/arrow/compute/kernels/scalar_validity.cc ## @@ -0,0 +1,107 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/common.h" + +#include "arrow/util/bit_util.h" +#include "arrow/util/bitmap_ops.h" + +namespace arrow { + +using internal::CopyBitmap; +using internal::InvertBitmap; + +namespace compute { +namespace { + +struct IsValidOperator { + static void Call(KernelContext* ctx, const Scalar& in, Scalar* out) { +checked_cast(out)->value = in.is_valid; + } + + static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) { +DCHECK_EQ(out->offset, 0); +DCHECK_LE(out->length, arr.length); +if (arr.buffers[0] != nullptr) { + out->buffers[1] = arr.offset == 0 +? arr.buffers[0] +: SliceBuffer(arr.buffers[0], arr.offset / 8, arr.length / 8); + out->offset = arr.offset % 8; Review comment: It's not worth thinking too hard right now because kernel pipelining (temporary elision) is not implemented yet so until that happens it would be just speculation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7461: ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators
bkietz commented on a change in pull request #7461: URL: https://github.com/apache/arrow/pull/7461#discussion_r441579699 ## File path: cpp/src/arrow/compute/kernels/codegen_internal.h ## @@ -787,6 +830,41 @@ ArrayKernelExec Integer(detail::GetTypeId get_id) { } } +template class Generator, typename Type0, typename... Args> +ArrayKernelExec IntegerBased(detail::GetTypeId get_id) { Review comment: FWIW, we have slight precedent for naming this "GenerateForPhysicalInteger": `type_traits.h::is_physical_integer_type` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] xhochy commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
xhochy commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645401816 > Would a lookup table in the order of 256kb (generated at runtime, not in the binary) per case mapping be acceptable for Arrow? I would find that acceptable if the mapping is only generated if needed (thus you will have a one-off payment when using a UTF8-kernel). I would though prefer if `utf8proc` could implement it just like this on their side. Can you open an issue there? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org