[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types
emkornfield commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r447396655 ## File path: docs/source/format/Columnar.rst ## @@ -566,33 +572,28 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]`` :: * Length: 4, Null count: 1 -* Validity bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-63| - |-|---| - |1101 | 0 (padding) | - * Types buffer: |Byte 0 | Byte 1 | Byte 2 | Byte 3 | Bytes 4-63 | |-|-|--|--|-| - | 0 | unspecified | 0| 1| unspecified | + | 0 | 0 | 0| 1| unspecified | * Offset buffer: |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | |--|-||-|-| - | 0| unspecified | 1 | 0 | unspecified | + | 0| 1 | 2 | 0 | unspecified | * Children arrays: * Field-0 array (f: float): * Length: 2, nulls: 0 Review comment: should this be nulls: 1 now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7535: ARROW-9222: [Format][DONOTMERGE] Columnar.rst changes for removing validity bitmap from union types
emkornfield commented on a change in pull request #7535: URL: https://github.com/apache/arrow/pull/7535#discussion_r447396488 ## File path: docs/source/format/Columnar.rst ## @@ -688,11 +687,10 @@ will have the following layout: :: ||---| | joemark| unspecified (padding) | -Similar to structs, a particular child array may have a non-null slot -even if the validity bitmap of the parent union array indicates the -slot is null. Additionally, a child array may have a non-null slot -even if the types array indicates that a slot contains a different -type at the index. +Note that a child array may have a non-null slot even if the types array +indicates that a slot contains a different type at the index, so the Review comment: I have a little trouble parsing this language. Could it be stated more as a positive? > Only the slot in the array corresponding to the type index is considered. All "unselected" values are ignored and could be any semantically correct array value. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
emkornfield commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-651515749 @liyafan82 I think that is a good point. If it supports both modes I think that is a reasonable compromise for now as long as @jacques-n is OK with it. But we can maybe discuss further if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #6156: ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices
emkornfield commented on pull request #6156: URL: https://github.com/apache/arrow/pull/6156#issuecomment-651509291 Can we leave the old method in place and mark it as deprecated and remove in a later release? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7586: Calculate page and column statistics
github-actions[bot] commented on pull request #7586: URL: https://github.com/apache/arrow/pull/7586#issuecomment-651495743 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 edited a comment on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
liyafan82 edited a comment on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666 In addition to the problem of top level validity buffer, I think there is another problem to discuss: Java is using the ordinal of the minor type as the type id, and this is not aligning with the C++ implementation/specification. This PR works around the problem by supporting two modes. However, the C++ and Java implementations are not really aligning, IMO. Do we have a plan to mitigate this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working
liyafan82 commented on pull request #7290: URL: https://github.com/apache/arrow/pull/7290#issuecomment-651492666 In addition to the problem of top level validity buffer, I think there is another problem to discuss: Java is using the ordinal of the minor type as the type id, and this is not aligning with the C++ implementation/specification. This PR works around the problem by supporting two modes. However, the C++ and Java implementations are not really aligning, IMO. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeevm opened a new pull request #7586: Calculate page and column statistics
zeevm opened a new pull request #7586: URL: https://github.com/apache/arrow/pull/7586 1. Calculate page and column statistics 2. Use pre-calculated statistics when available to speed-up when writing data from other formats like ORC. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson commented on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651484901 Ok, this isn't necessarily pretty but I think it's done, or done enough for here. I'll add some more tests, probably some docs for the format, and poke around a bit more while doing the tickets for `haven::labelled` and `POSIXlt` types. @romainfrancois if you can glance at the changes I made and confirm that I haven't butchered things too badly, that would be great, thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson edited a comment on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042 I'm taking this over. Outstanding TODOs: - [x] Add tests - [x] Support record batches - [x] Support nested types (requires adapting the data structure and adding recursion) - [x] Test the print method - [x] Test/handle bad data in metadata$r; allow users to edit it manually? - [x] Don't keep attributes (e.g. factor levels) that we explicitly translate to Arrow already Top-level dataset metadata is deferred to https://issues.apache.org/jira/browse/ARROW-9271. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on pull request #7347: URL: https://github.com/apache/arrow/pull/7347#issuecomment-651474250 @rymurr Thanks for your work. A few typos. I think it would be ready for merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r447363600 ## File path: java/memory/src/main/java/org/apache/arrow/memory/ArrowBuf.java ## @@ -227,13 +207,25 @@ public ArrowBuf slice(long index, long length) { return newBuf; } + /** + * Make a nio byte buffer from this arrowbuf. + */ public ByteBuffer nioBuffer() { -return asNettyBuffer().nioBuffer(); +return nioBuffer(readerIndex, checkedCastToInt(readableBytes())); } + + /** + * Make an nio byte buffer from this ArrowBuf. Review comment: nit: an -> a This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r447363481 ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -17,33 +17,107 @@ package org.apache.arrow.memory.rounding; -import java.lang.reflect.Field; - -import org.apache.arrow.memory.NettyAllocationManager; import org.apache.arrow.memory.util.CommonUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.util.internal.SystemPropertyUtil; /** * The default rounding policy. That is, if the requested size is within the chunk size, * the rounded size will be the next power of two. Otherwise, the rounded size * will be identical to the requested size. */ public class DefaultRoundingPolicy implements RoundingPolicy { - + private static final Logger logger = LoggerFactory.getLogger(DefaultRoundingPolicy.class); public final long chunkSize; /** - * The singleton instance. + * The variables here and the static block calculates teh DEFAULT_CHUNK_SIZE. + * + * + * It was copied from {@link io.netty.buffer.PooledByteBufAllocator}. + * */ - public static final DefaultRoundingPolicy INSTANCE = new DefaultRoundingPolicy(); + private static final int MIN_PAGE_SIZE = 4096; + private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 1) / 2); + private static final long DEFAULT_CHUNK_SIZE; + private static final int DEFAULT_PAGE_SIZE; + private static final int DEFAULT_MAX_ORDER; - private DefaultRoundingPolicy() { + + static { +int defaultPageSize = SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192); +Throwable pageSizeFallbackCause = null; try { - Field field = NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE"); - field.setAccessible(true); - chunkSize = (Long) field.get(null); -} catch (Exception e) { - throw new RuntimeException("Failed to get chunk size from allocation manager"); + validateAndCalculatePageShifts(defaultPageSize); +} catch (Throwable t) { + pageSizeFallbackCause = t; + defaultPageSize = 8192; +} +DEFAULT_PAGE_SIZE = defaultPageSize; + +int defaultMaxOrder = SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11); +Throwable maxOrderFallbackCause = null; +try { + validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, defaultMaxOrder); +} catch (Throwable t) { + maxOrderFallbackCause = t; + defaultMaxOrder = 11; +} +DEFAULT_MAX_ORDER = defaultMaxOrder; +DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(DEFAULT_PAGE_SIZE, DEFAULT_MAX_ORDER); +if (logger.isDebugEnabled()) { + if (pageSizeFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", DEFAULT_PAGE_SIZE); + } else { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", DEFAULT_PAGE_SIZE, pageSizeFallbackCause); + } + if (maxOrderFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", DEFAULT_MAX_ORDER); + } else { +logger.debug("-Dorg.apache.memory.allocator.maxOrder: {}", DEFAULT_MAX_ORDER, maxOrderFallbackCause); + } Review comment: Also here we need only one parameter in the "else" statement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #7347: ARROW-8230: [Java] Remove netty dependency from arrow-memory
liyafan82 commented on a change in pull request #7347: URL: https://github.com/apache/arrow/pull/7347#discussion_r447363293 ## File path: java/memory/src/main/java/org/apache/arrow/memory/rounding/DefaultRoundingPolicy.java ## @@ -17,33 +17,103 @@ package org.apache.arrow.memory.rounding; -import java.lang.reflect.Field; - -import org.apache.arrow.memory.NettyAllocationManager; import org.apache.arrow.memory.util.CommonUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.util.internal.SystemPropertyUtil; /** * The default rounding policy. That is, if the requested size is within the chunk size, * the rounded size will be the next power of two. Otherwise, the rounded size * will be identical to the requested size. */ public class DefaultRoundingPolicy implements RoundingPolicy { - + private static final Logger logger = LoggerFactory.getLogger(DefaultRoundingPolicy.class); public final long chunkSize; /** - * The singleton instance. + * The variables here and the static block calculates the DEFAULT_CHUNK_SIZE. + * + * + * It was copied from {@link io.netty.buffer.PooledByteBufAllocator}. + * */ - public static final DefaultRoundingPolicy INSTANCE = new DefaultRoundingPolicy(); + private static final int MIN_PAGE_SIZE = 4096; + private static final int MAX_CHUNK_SIZE = (int) (((long) Integer.MAX_VALUE + 1) / 2); + private static final long DEFAULT_CHUNK_SIZE; - private DefaultRoundingPolicy() { + + static { +int defaultPageSize = SystemPropertyUtil.getInt("org.apache.memory.allocator.pageSize", 8192); +Throwable pageSizeFallbackCause = null; try { - Field field = NettyAllocationManager.class.getDeclaredField("CHUNK_SIZE"); - field.setAccessible(true); - chunkSize = (Long) field.get(null); -} catch (Exception e) { - throw new RuntimeException("Failed to get chunk size from allocation manager"); + validateAndCalculatePageShifts(defaultPageSize); +} catch (Throwable t) { + pageSizeFallbackCause = t; + defaultPageSize = 8192; +} + +int defaultMaxOrder = SystemPropertyUtil.getInt("org.apache.memory.allocator.maxOrder", 11); +Throwable maxOrderFallbackCause = null; +try { + validateAndCalculateChunkSize(defaultPageSize, defaultMaxOrder); +} catch (Throwable t) { + maxOrderFallbackCause = t; + defaultMaxOrder = 11; +} +DEFAULT_CHUNK_SIZE = validateAndCalculateChunkSize(defaultPageSize, defaultMaxOrder); +if (logger.isDebugEnabled()) { + if (pageSizeFallbackCause == null) { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", defaultPageSize); + } else { +logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", defaultPageSize, pageSizeFallbackCause); + } Review comment: I think here we only need one parameter here? So it should be logger.debug("-Dorg.apache.memory.allocator.pageSize: {}", pageSizeFallbackCause); This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
github-actions[bot] commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651467252 Revision: 821f30a834dab99cdc757100e51986384f0a391c Submitted crossbow builds: [ursa-labs/crossbow @ actions-367](https://github.com/ursa-labs/crossbow/branches/all?query=actions-367) |Task|Status| ||--| |centos-6-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-6-amd64)| |centos-7-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-7-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |centos-7-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-7-amd64)| |centos-8-aarch64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-367-travis-centos-8-aarch64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |centos-8-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-367-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-367-github-centos-8-amd64)| |conda-clean|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-clean)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-clean)| |conda-linux-gcc-py36-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cpu)| |conda-linux-gcc-py36-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py36-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py36-cuda)| |conda-linux-gcc-py37-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cpu)| |conda-linux-gcc-py37-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py37-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py37-cuda)| |conda-linux-gcc-py38-cpu|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cpu)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cpu)| |conda-linux-gcc-py38-cuda|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-linux-gcc-py38-cuda)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-linux-gcc-py38-cuda)| |conda-osx-clang-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py36)| |conda-osx-clang-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py37)| |conda-osx-clang-py38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-osx-clang-py38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-osx-clang-py38)| |conda-win-vs2015-py36|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py36)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py36)| |conda-win-vs2015-py37|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-367-azure-conda-win-vs2015-py37)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-367-azure-conda-win-vs2015-py37)|
[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
wesm commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466752 @github-actions crossbow submit -g linux -g wheel -g conda This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
wesm commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651466442 Actually, that's crazy. I'm taking the same approach as ZSTD and adding a CMake toggle between shared and static Brotli (with default being shared) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7585: ARROW-3520: [C++] Add "list_flatten" vector kernel wrapper for Flatten method of ListArray types
github-actions[bot] commented on pull request #7585: URL: https://github.com/apache/arrow/pull/7585#issuecomment-651460741 https://issues.apache.org/jira/browse/ARROW-3520 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
wesm commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651460507 Hm not so fast. The macOS py35 failure seems legitimate https://travis-ci.org/github/ursa-labs/crossbow/builds/703242650#L10060 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
wesm closed pull request #7560: URL: https://github.com/apache/arrow/pull/7560 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
wesm commented on pull request #7560: URL: https://github.com/apache/arrow/pull/7560#issuecomment-651458622 Yahtzee This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn edited a comment on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470 @wesm OK. I continue to work for benchmarking in this pull-request. If I need more time to tune etc., I'll split the issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651458470 OK. I continue to work for benchmarking in this pull-request. If I need more time to tune etc., I'll split the issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types
wesm closed pull request #7569: URL: https://github.com/apache/arrow/pull/7569 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7569: ARROW-9152: [C++] Specialized implementation of filtering Binary/LargeBinary-based types
wesm commented on pull request #7569: URL: https://github.com/apache/arrow/pull/7569#issuecomment-651457980 +1, this is a bit dry so would rather reviewers reserve their time for other PRs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm edited a comment on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062 @mrkn it's up to you, it's fine with me if you work on performance tuning (or at least measurement) in another PR or this one This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651457062 @mrkn it's up to you, it's fine with me if you work on performance tuning in another PR or this one This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on pull request #7539: URL: https://github.com/apache/arrow/pull/7539#issuecomment-651456329 @wesm Is it better to work for benchmarking in other pull-request? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson edited a comment on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042 I'm taking this over. Outstanding TODOs: - [x] Add tests - [x] Support record batches - [x] Support nested types (requires adapting the data structure and adding recursion) - [x] Test the print method - [ ] Test/handle bad data in metadata$r; allow users to edit it manually? - [x] Don't keep attributes (e.g. factor levels) that we explicitly translate to Arrow already Top-level dataset metadata is deferred to https://issues.apache.org/jira/browse/ARROW-9271. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types
wesm closed pull request #7585: URL: https://github.com/apache/arrow/pull/7585 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm opened a new pull request #7585: ARROW-3520: [C++] WIP Add vector kernel wrapper for Flatten method of ListArray types
wesm opened a new pull request #7585: URL: https://github.com/apache/arrow/pull/7585 I'm testing a JIRA webhook, I'll close this PR and then reopen it when the patch is done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
wesm commented on pull request #7560: URL: https://github.com/apache/arrow/pull/7560#issuecomment-651423838 Will merge this if the build passes with the arrow-testing changes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zhztheplayer commented on pull request #7030: ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++
zhztheplayer commented on pull request #7030: URL: https://github.com/apache/arrow/pull/7030#issuecomment-651422684 Thanks for the comments! I've got some stuffs to deal with these days. Will address as soon as possible. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion
github-actions[bot] commented on pull request #7584: URL: https://github.com/apache/arrow/pull/7584#issuecomment-651404016 https://issues.apache.org/jira/browse/ARROW-9272 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs opened a new pull request #7584: ARROW-9272: [C++][Python] Reduce complexity in python to arrow conversion
kszucs opened a new pull request #7584: URL: https://github.com/apache/arrow/pull/7584 The original motivation for this patch was to reuse the same conversions path for both the scalars and arrays. In my recent patch the scalars are converted from a single element list to a single element array then copied out from it. On the long term we should convert them directly, perhaps with a more generic converter API, until that this patch aims to reduce code complexity without introducing any regressions. I checked the produced binary size: ```console # BEFORE -rwxr-xr-x 1 kszucs staff 2926832 Jun 29 23:07 libarrow_python.100.0.0.dylib # AFTER -rwxr-xr-x 1 kszucs staff 2869136 Jun 29 23:06 libarrow_python.100.0.0.dylib ``` ``` convert_builtins.ConvertArrayToPyList.time_convert == = type before after -- - int32 13.2±0ms 13.2±0m uint32 13.4±0ms 14.8±0m int64 16.4±0ms 13.7±0m uint64 13.0±0ms 17.4±0m float32 13.5±0ms 16.0±0m float64 16.3±0ms 14.7±0m bool 12.0±0ms 12.0±0m decimal 61.1±0ms 65.0±0m binary 19.1±0ms 18.2±0m binary1017.9±0ms 17.4±0m ascii 26.0±0ms 25.9±0m unicode 81.4±0ms 111±0m int64 list 145±0ms152±0m struct 229±0ms212±0m == = convert_builtins.ConvertPyListToArray.time_convert = type before after - int32 129±0ms 135±0m uint32 139±0ms 134±0m int64 132±0ms 139±0m uint64 135±0ms 132±0m float32 149±0ms 131±0m float64 143±0ms 141±0m bool 126±0ms 130±0m decimal 139±0ms 135±0m binary 145±0ms 130±0m binary10136±0ms 130±0m ascii 145±0ms 136±0m unicode 148±0ms 149±0m int64 list 174±0ms 176±0m struct 164±0ms 177±0m struct from tuples 165±0ms 169±0m = convert_builtins.InferPyListToArray.time_infer = type beforeafter - int64 157±0ms 158±0m float64 155±0ms 153±0m bool 155±0ms 154±0m decimal 249±0ms 233±0m binary 167±0ms 162±0m ascii 181±0ms 154±0m unicode 179±0ms 171±0m int64 list 188±0ms 192±0m struct 177±0ms 181±0m = ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447268862 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: Oops, I don't need to return `-1` because `checked_cast` is used in this function. Sorry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mrkn commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
mrkn commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447267538 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: OK. I'll make `inernal::GetByteWidth`, and let it return `-1` for non-`FixedWidthType` values. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447264419 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: As long as the signature of the function is `int(const DataType&)` it sounds good to me. I don't think it should be a non-internal API because it does not need to do error-handling. If it's a non-static member of FixedWidthType, then a `checked_cast` is still necessary which seems against the spirit of making the code simpler This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
nealrichardson commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651369436 > > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9 > > What difference does it make? This is plain C. :shrug: then I'll leave it to you to sort out as this is beyond my knowledge. In the past, undefined symbols error + only compiled for rtools-packages (gcc8) = you need to get it built with rtools-backports too. Maybe something's off with the lib that was built, IDK if anyone has verified that it works. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
wesm commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651368104 Indeed, toolchain incompatibilities only affect C++ code This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651366993 > The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9 What difference does it make? This is plain C. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on a change in pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
kou commented on a change in pull request #7581: URL: https://github.com/apache/arrow/pull/7581#discussion_r447247927 ## File path: cpp/src/arrow/config.h ## @@ -0,0 +1,47 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include + +#include "arrow/util/config.h" // IWYU pragma: export +#include "arrow/util/visibility.h" + +namespace arrow { + +struct BuildInfo { + int major; + int minor; + int patch; Review comment: Oh, sorry. One comment wasn't submitted... How about adding `version_` prefix because our `ARROW_VERSION_*` have `VERSION_` prefix? ```suggestion int version_major; int version_minor; int version_patch; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
nealrichardson commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651355763 > > This means there also needs to be a PKGBUILD > > Why? `libutf8proc` is installed. The version installed is compiled with gcc 8. RTools 35 uses gcc 4.9. Most of our deps have to be compiled for both, and this is apparently one of those. That's what https://github.com/r-windows/rtools-backports is for. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651353338 > This means there also needs to be a PKGBUILD Why? `libutf8proc` is installed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651352568 > It would be also nice to store the enabled features. Agreed, but that can be done in a separate PR. > How about adding int BuildInfo::version for ARROW_VERSION too? Good idea. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651352858 Also, I'll let others add `-DARROW_PACKAGE_KIND=...` in other places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata
wesm commented on pull request #7571: URL: https://github.com/apache/arrow/pull/7571#issuecomment-651351599 I'll close this for now. Please leave any review comments and I can address them later This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou merged pull request #7583: [Doc][C++] Follow docker-compose service name change for lint
kou merged pull request #7583: URL: https://github.com/apache/arrow/pull/7583 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7571: ARROW-8671: [C++] Use new BodyCompression Flatbuffers member for IPC compression metadata
wesm closed pull request #7571: URL: https://github.com/apache/arrow/pull/7571 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.
wesm closed pull request #7576: URL: https://github.com/apache/arrow/pull/7576 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.
wesm commented on pull request #7576: URL: https://github.com/apache/arrow/pull/7576#issuecomment-651350872 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.
wesm commented on pull request #7576: URL: https://github.com/apache/arrow/pull/7576#issuecomment-651350708 I think both the 1/1000 and 1/1 cases have something interesting to show perf wise, but in any case using 1M as the length in this benchmark seems OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson edited a comment on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson edited a comment on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042 I'm taking this over. Outstanding TODOs: - [x] Add tests - [x] Support record batches - [x] Support nested types (requires adapting the data structure and adding recursion) - [ ] Test the print method - [ ] Test/handle bad data in metadata$r; allow users to edit it manually? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
nealrichardson commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651342350 > @xhochy Could you help on the utf8proc issue on RTools 3.5? > See here: https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169 This means there also needs to be a PKGBUILD submitted to `r-windows/rtools-backports` for the old toolchain. > > It seems that `UTF8PROC_STATIC` would need to be defined when building Arrow. But it's not set by `Findutf8proc.cmake`. > Also, `libutf8proc.pc.in` added in [r-windows/rtools-packages#124](https://github.com/r-windows/rtools-packages/pull/124) doesn't set it either. Just a reminder that nothing in the R bindings touches these new functions, so turning off utf8proc in the C++ build is also an option for now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
kszucs commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651339349 It would be also nice to store the enabled features. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651338264 @xhochy Could you help on the utf8proc issue on RTools 3.5? See here: https://github.com/apache/arrow/pull/7449/checks?check_run_id=819772618#step:10:169 It seems that `UTF8PROC_STATIC` would need to be defined when building Arrow. But it's not set by `Findutf8proc.cmake`. Also, `libutf8proc.pc.in` added in https://github.com/r-windows/rtools-packages/pull/124 doesn't set it either. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651322087 I just concluded the same :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651316656 I pushed a commit that raises an error on invalid UTF8. It does not seem to make the benchmarks slower. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651289874 @pitrou your size commit made the benchmark go from `52->60 M/s` > Yes, too. The main point of this state-machine-based decoder is that it's branchless, and so it will perform roughly as well on non-Ascii data with unpredictable branching. On pure Ascii data, a branch-based decoder may be faster since the branches will always be predicted right. Yes, it would be interesting to see how the two methods deals with a 25/25/25/25% mix of 1-2-3 or 4 byte encoded codepoints, vs say a few % non-ascii. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447171303 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -133,23 +134,23 @@ struct Utf8Transform { output_string_offsets[i + 1] = output_ncodeunits; } - // trim the codepoint buffer, since we allocated too much + // Trim the codepoint buffer, since we allocated too much KERNEL_RETURN_IF_ERROR( - ctx, - output->buffers[2]->CopySlice(0, output_ncodeunits).Value(>buffers[2])); + ctx, values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true)); Review comment: :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447170380 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -133,23 +134,23 @@ struct Utf8Transform { output_string_offsets[i + 1] = output_ncodeunits; } - // trim the codepoint buffer, since we allocated too much + // Trim the codepoint buffer, since we allocated too much KERNEL_RETURN_IF_ERROR( - ctx, - output->buffers[2]->CopySlice(0, output_ncodeunits).Value(>buffers[2])); + ctx, values_buffer->Resize(output_ncodeunits, /*shrink_to_fit=*/true)); Review comment: Nice way to make code more readable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou edited a comment on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou edited a comment on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282959 Main point remaining is whether we raise an error on invalid UTF8 input. I see no reason not to (an Arrow string array has to be valid UTF8 as per the spec, just like a Python unicode string cannot contain characters outside of the unicode code range). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282959 Main point remaining is whether we raise an error on invalid UTF8 input. I see no reason not too (an Arrow string array has to be valid UTF8 as per the spec, just like a Python unicode string cannot contain characters outside of the unicode code range). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282415 > Having a benchmark run on non-ascii codepoints (I think we want to do this separate from this PR, but important point). Yes, I think we can defer that to a separate PR. > The existing decoder based on http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. Very interesting work, but unfortunately led to a performance regression (~50->30 M/s), which I'm surprised about actually. Maybe worth comparing again when we have a benchmark with non-ascii codepoints. Yes, too. The main point of this state-machine-based decoder is that it's branchless, and so it will perform roughly as well on non-Ascii data with unpredictable branching. On pure Ascii data, a branch-based decoder may be faster since the branches will always be predicted right. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447161925 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -39,6 +158,121 @@ struct AsciiLength { } }; +template class Derived> +struct Utf8Transform { + using offset_type = typename Type::offset_type; + using DerivedClass = Derived; + using ArrayType = typename TypeTraits::ArrayType; + + static offset_type Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output) { +uint8_t* dest = output; +utf8_transform(input, input + input_string_ncodeunits, dest, + DerivedClass::TransformCodepoint); +return (offset_type)(dest - output); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + std::call_once(flag_case_luts, []() { +lut_upper_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +lut_lower_codepoint.reserve(MAX_CODEPOINT_LUT + 1); +for (int i = 0; i <= MAX_CODEPOINT_LUT; i++) { + lut_upper_codepoint.push_back(utf8proc_toupper(i)); + lut_lower_codepoint.push_back(utf8proc_tolower(i)); +} + }); + const ArrayData& input = *batch[0].array(); + ArrayType input_boxed(batch[0].array()); + ArrayData* output = out->mutable_array(); + + offset_type const* input_string_offsets = input.GetValues(1); + utf8proc_uint8_t const* input_str = + input.buffers[2]->data() + input_boxed.value_offset(0); + offset_type input_ncodeunits = input_boxed.total_values_length(); + offset_type input_nstrings = (offset_type)input.length; + + // Section 5.18 of the Unicode spec claim that the number of codepoints for case + // mapping can grow by a factor of 3. This means grow by a factor of 3 in bytes + // However, since we don't support all casings (SpecialCasing.txt) the growth + // is actually only at max 3/2 (as covered by the unittest). + // Note that rounding down the 3/2 is ok, since only codepoints encoded by + // two code units (even) can grow to 3 code units. + + int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2; + if (output_ncodeunits_max > std::numeric_limits::max()) { +ctx->SetStatus(Status::CapacityError( +"Result might not fit in a 32bit utf8 array, convert to large_utf8")); +return; + } + + KERNEL_RETURN_IF_ERROR( + ctx, ctx->Allocate(output_ncodeunits_max).Value(>buffers[2])); + // We could reuse the buffer if it is all ascii, benchmarking showed this not to + // matter + // output->buffers[1] = input.buffers[1]; + KERNEL_RETURN_IF_ERROR(ctx, + ctx->Allocate((input_nstrings + 1) * sizeof(offset_type)) + .Value(>buffers[1])); + utf8proc_uint8_t* output_str = output->buffers[2]->mutable_data(); + offset_type* output_string_offsets = output->GetMutableValues(1); + offset_type output_ncodeunits = 0; + + offset_type output_string_offset = 0; + *output_string_offsets = output_string_offset; + offset_type input_string_first_offset = input_string_offsets[0]; + for (int64_t i = 0; i < input_nstrings; i++) { +offset_type input_string_offset = +input_string_offsets[i] - input_string_first_offset; +offset_type input_string_end = +input_string_offsets[i + 1] - input_string_first_offset; +offset_type input_string_ncodeunits = input_string_end - input_string_offset; +offset_type encoded_nbytes = DerivedClass::Transform( +input_str + input_string_offset, input_string_ncodeunits, +output_str + output_ncodeunits); +output_ncodeunits += encoded_nbytes; +output_string_offsets[i + 1] = output_ncodeunits; + } + + // trim the codepoint buffer, since we allocated too much + KERNEL_RETURN_IF_ERROR( + ctx, + output->buffers[2]->CopySlice(0, output_ncodeunits).Value(>buffers[2])); Review comment: Ok, I just pushed a change which resizes the buffer (hopefully in-place). This makes the benchmarks a bit faster. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447161391 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,124 @@ namespace internal { namespace { +// lookup tables +std::vector lut_upper_codepoint; +std::vector lut_lower_codepoint; +std::once_flag flag_case_luts; + +constexpr uint32_t REPLACEMENT_CHAR = +'?'; // the proper replacement char would be the 0xFFFD codepoint, but that can + // increase string length by a factor of 3 +constexpr int MAX_CODEPOINT_LUT = 0x; // up to this codepoint is in a lookup table + +static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) { + if (codepoint < 0x80) { +*str++ = codepoint; + } else if (codepoint < 0x800) { +*str++ = 0xC0 + (codepoint >> 6); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x1) { +*str++ = 0xE0 + (codepoint >> 12); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x20) { +*str++ = 0xF0 + (codepoint >> 18); +*str++ = 0x80 + ((codepoint >> 12) & 0x3F); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else { +*str++ = codepoint; + } +} + +static inline bool utf8_is_continuation(const uint8_t codeunit) { + return (codeunit & 0xC0) == 0x80; // upper two bits should be 10 +} + +static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) { + if (*str < 0x80) { // +length -= 1; +return *str++; + } else if (*str < 0xC0) { // invalid non-ascii char +length -= 1; +str++; +return REPLACEMENT_CHAR; Review comment: There's no reason a priori to be forgiving on invalid data. It there's a use case to be tolerant, we may add an option. But by default we should error out on invalid input, IMHO. cc @wesm for opinions This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447161391 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -30,6 +31,124 @@ namespace internal { namespace { +// lookup tables +std::vector lut_upper_codepoint; +std::vector lut_lower_codepoint; +std::once_flag flag_case_luts; + +constexpr uint32_t REPLACEMENT_CHAR = +'?'; // the proper replacement char would be the 0xFFFD codepoint, but that can + // increase string length by a factor of 3 +constexpr int MAX_CODEPOINT_LUT = 0x; // up to this codepoint is in a lookup table + +static inline void utf8_encode(uint8_t*& str, uint32_t codepoint) { + if (codepoint < 0x80) { +*str++ = codepoint; + } else if (codepoint < 0x800) { +*str++ = 0xC0 + (codepoint >> 6); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x1) { +*str++ = 0xE0 + (codepoint >> 12); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else if (codepoint < 0x20) { +*str++ = 0xF0 + (codepoint >> 18); +*str++ = 0x80 + ((codepoint >> 12) & 0x3F); +*str++ = 0x80 + ((codepoint >> 6) & 0x3F); +*str++ = 0x80 + (codepoint & 0x3F); + } else { +*str++ = codepoint; + } +} + +static inline bool utf8_is_continuation(const uint8_t codeunit) { + return (codeunit & 0xC0) == 0x80; // upper two bits should be 10 +} + +static inline uint32_t utf8_decode(const uint8_t*& str, int64_t& length) { + if (*str < 0x80) { // +length -= 1; +return *str++; + } else if (*str < 0xC0) { // invalid non-ascii char +length -= 1; +str++; +return REPLACEMENT_CHAR; Review comment: There's no reason a priori to be forgiving on invalid data. It there's use case to be tolerant, we may add an option. But by default we should error out on invalid input. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sbinet closed pull request #7483: ARROW-9174: [Go] Fix table panic on 386
sbinet closed pull request #7483: URL: https://github.com/apache/arrow/pull/7483 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447154836 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -15,13 +15,15 @@ // specific language governing permissions and limitations // under the License. +#include Review comment: Np, I'd rather do it correct. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447155149 ## File path: cpp/src/arrow/compute/kernels/scalar_string.cc ## @@ -39,6 +73,103 @@ struct AsciiLength { } }; +template +struct Utf8Transform { + using offset_type = typename Type::offset_type; + using ArrayType = typename TypeTraits::ArrayType; + + static offset_type Transform(const uint8_t* input, offset_type input_string_ncodeunits, + uint8_t* output) { +uint8_t* output_end = arrow::util::Utf8Transform( +input, input + input_string_ncodeunits, output, Derived::TransformCodepoint); +return static_cast(output_end - output); + } + + static void Exec(KernelContext* ctx, const ExecBatch& batch, Datum* out) { +if (batch[0].kind() == Datum::ARRAY) { + EnsureLookupTablesFilled(); + const ArrayData& input = *batch[0].array(); + ArrayType input_boxed(batch[0].array()); + ArrayData* output = out->mutable_array(); + + offset_type input_ncodeunits = input_boxed.total_values_length(); + offset_type input_nstrings = static_cast(input.length); + + // Section 5.18 of the Unicode spec claim that the number of codepoints for case + // mapping can grow by a factor of 3. This means grow by a factor of 3 in bytes + // However, since we don't support all casings (SpecialCasing.txt) the growth + // is actually only at max 3/2 (as covered by the unittest). + // Note that rounding down the 3/2 is ok, since only codepoints encoded by + // two code units (even) can grow to 3 code units. + + int64_t output_ncodeunits_max = ((int64_t)input_ncodeunits) * 3 / 2; Review comment: Old habit, doing my best :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
pitrou commented on a change in pull request #7449: URL: https://github.com/apache/arrow/pull/7449#discussion_r447143530 ## File path: cpp/src/arrow/compute/kernels/scalar_string_test.cc ## @@ -81,5 +147,40 @@ TYPED_TEST(TestStringKernels, StrptimeDoesNotProvideDefaultOptions) { ASSERT_RAISES(Invalid, CallFunction("strptime", {input})); } +TEST(TestStringKernels, UnicodeLibraryAssumptions) { + uint8_t output[4]; + for (utf8proc_int32_t codepoint = 0x100; codepoint < 0x11; codepoint++) { +utf8proc_ssize_t encoded_nbytes = utf8proc_encode_char(codepoint, output); +utf8proc_int32_t codepoint_upper = utf8proc_toupper(codepoint); +utf8proc_ssize_t encoded_nbytes_upper = utf8proc_encode_char(codepoint_upper, output); +if (encoded_nbytes == 2) { + EXPECT_LE(encoded_nbytes_upper, 3) + << "Expected the upper case codepoint for a 2 byte encoded codepoint to be " + "encoded in maximum 3 bytes, not " + << encoded_nbytes_upper; +} +if (encoded_nbytes == 3) { Review comment: Ok, thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7524: ARROW-8899 [R] Add R metadata like pandas metadata for round-trip fidelity
nealrichardson commented on pull request #7524: URL: https://github.com/apache/arrow/pull/7524#issuecomment-651263042 I'm taking this over. Outstanding TODOs: - [x] Add tests - [x] Support record batches - [ ] Support nested types (requires adapting the data structure and adding recursion) - [ ] Test the print method - [ ] Test/handle bad data in metadata$r; allow users to edit it manually? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #7559: ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray
pitrou closed pull request #7559: URL: https://github.com/apache/arrow/pull/7559 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
github-actions[bot] commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651261802 Revision: 989cd4023a59159b44f69a6d5f530acc815a2407 Submitted crossbow builds: [ursa-labs/crossbow @ actions-366](https://github.com/ursa-labs/crossbow/branches/all?query=actions-366) |Task|Status| ||--| |wheel-manylinux1-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp35m)| |wheel-manylinux1-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp36m)| |wheel-manylinux1-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp37m)| |wheel-manylinux1-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux1-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux1-cp38)| |wheel-manylinux2010-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp35m)| |wheel-manylinux2010-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp36m)| |wheel-manylinux2010-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp37m)| |wheel-manylinux2010-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2010-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2010-cp38)| |wheel-manylinux2014-cp35m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp35m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp35m)| |wheel-manylinux2014-cp36m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp36m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp36m)| |wheel-manylinux2014-cp37m|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp37m)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp37m)| |wheel-manylinux2014-cp38|[![Azure](https://dev.azure.com/ursa-labs/crossbow/_apis/build/status/ursa-labs.crossbow?branchName=actions-366-azure-wheel-manylinux2014-cp38)](https://dev.azure.com/ursa-labs/crossbow/_build/latest?definitionId=1=actions-366-azure-wheel-manylinux2014-cp38)| |wheel-osx-cp35m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp35m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp36m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp36m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp37m|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp37m.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-osx-cp38|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-366-travis-wheel-osx-cp38.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)| |wheel-win-cp35m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-366-appveyor-wheel-win-cp35m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)| |wheel-win-cp36m|[![Appveyor](https://img.shields.io/appveyor/ci/ursa-labs/crossbow/actions-366-appveyor-wheel-win-cp36m.svg)](https://ci.appveyor.com/project/ursa-labs/crossbow/history)|
[GitHub] [arrow] github-actions[bot] commented on pull request #7583: [Doc][C++] docker compose lint -> ubuntu-link
github-actions[bot] commented on pull request #7583: URL: https://github.com/apache/arrow/pull/7583#issuecomment-651261350 Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Then could you also rename pull request title in the following format? ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY} See also: * [Other pull requests](https://github.com/apache/arrow/pulls/) * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
nealrichardson commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651261388 The R Windows builds will fail until either utf8proc is not required by default (https://issues.apache.org/jira/browse/ARROW-9220) or until libutf8proc is added as a dependency to https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD (@xhochy has done the necessary to get it built for Rtools in https://github.com/r-windows/rtools-packages/pull/124). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels opened a new pull request #7583: [Doc][C++] docker compose lint -> ubuntu-link
maartenbreddels opened a new pull request #7583: URL: https://github.com/apache/arrow/pull/7583 I guess the name changed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651257793 We still have 2 failures, one might need a restart (travis / no output), the other is still a linker error: ``` C:/rtools40/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.3.0/../../../../i686-w64-mingw32/bin/ld.exe: ../windows/arrow-0.17.1.9000/lib/i386/libarrow.a(scalar_string.cc.obj):(.text+0x45df): undefined reference to `utf8proc_tolower' ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7560: ARROW-9252: [Integration] Factor out IPC integration tests into script, add back 0.14.1 "gold" files
pitrou commented on a change in pull request #7560: URL: https://github.com/apache/arrow/pull/7560#discussion_r447134548 ## File path: ci/scripts/integration_arrow.sh ## @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -ex + +arrow_dir=${1} +source_dir=${1}/cpp +build_dir=${2}/cpp +gold_dir_0_14_1=${1}/testing/data/arrow-ipc-stream/integration/0.14.1 + +pip install -e /arrow/dev/archery Review comment: Use `$source_dir` here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651256468 @github-actions crossbow submit -g wheel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kylebrandt commented on pull request #7483: ARROW-9174: [Go] Fix table panic on 386
kylebrandt commented on pull request #7483: URL: https://github.com/apache/arrow/pull/7483#issuecomment-651255643 Hi @sbinet , new to contributing here (and see your name all over the Go code :-) ). Anything I need to do on my end for this to get merged? Thank you for all your work on this project. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651252764 Amusingly, even a minimal Debian or Ubuntu Docker image has `liblz4` and `liblzma`. I think we can simply change the script not to remove the zlib. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou removed a comment on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou removed a comment on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651252311 Amusingly, even a minimal Debian or Ubuntu Docker image has `liblz4` and `liblzma`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651252311 Amusingly, even a minimal Debian or Ubuntu Docker image has `liblz4` and `liblzma`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options
github-actions[bot] commented on pull request #7582: URL: https://github.com/apache/arrow/pull/7582#issuecomment-651251968 https://issues.apache.org/jira/browse/ARROW-8190 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
github-actions[bot] commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651248175 https://issues.apache.org/jira/browse/ARROW-6521 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new pull request #7582: ARROW-8190: [FlightRPC][C++] Expose IPC options
lidavidm opened a new pull request #7582: URL: https://github.com/apache/arrow/pull/7582 - Python is not covered as I'm not sure how best to expose these structs to Python. - Java is not covered as it doesn't use IpcOption at all currently; I'd rather hold off and see how the metadata change is implemented before trying to bind it. (The legacy/modern metadata format doesn't come into play since Flight uses Protobuf and not an IPC message exactly per se.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7576: ARROW-9263: [C++] Promote compute aggregate benchmark size to 1M.
pitrou commented on pull request #7576: URL: https://github.com/apache/arrow/pull/7576#issuecomment-651243430 Wouldn't it be more realistic to simply use 0.1% instead of 0.01%? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou opened a new pull request #7581: URL: https://github.com/apache/arrow/pull/7581 Also add build options and preprocessor constants to represent git identification and package kind (e.g. "manylinux1"). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7581: ARROW-6521: [C++] Add an API to query runtime build info
pitrou commented on pull request #7581: URL: https://github.com/apache/arrow/pull/7581#issuecomment-651242409 @kou and @xhochy your advice would be welcome. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
nealrichardson commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651188772 IIUC we're ok: * Windows: no brotli: https://github.com/apache/arrow/blob/master/ci/scripts/PKGBUILD * macOS: no brotli: https://github.com/apache/arrow/blob/master/dev/tasks/homebrew-formulae/autobrew/apache-arrow.rb * Linux: maybe brotli, but it's all static build anyway: https://github.com/apache/arrow/blob/master/r/inst/build_arrow_static.sh This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7559: ARROW-9247: [Python] Expose total_values_length functions on BinaryArray, LargeBinaryArray
wesm commented on pull request #7559: URL: https://github.com/apache/arrow/pull/7559#issuecomment-651187351 cc @pitrou or @jorisvandenbossche for review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
wesm commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651186700 If you're installing Brotli in any of the packaging setups, there may be a scenario where there is both the shared AND static library -- in that case there would be an issue. We need to see why the manylinux* builds have `libbrotli*.so` in them -- if for some reason Brotli's build is insubordinate and churlish we might have to add a flag to do a hard switch between shared/static like we do with ZSTD This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] nealrichardson commented on pull request #7556: ARROW-9188: [C++] Use Brotli shared libraries if they are available
nealrichardson commented on pull request #7556: URL: https://github.com/apache/arrow/pull/7556#issuecomment-651182567 @wesm how can I help/what should I look for? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7580: ARROW-9261: [Python] Fix CA certificate lookup with S3 filesystem on manylinux
pitrou commented on pull request #7580: URL: https://github.com/apache/arrow/pull/7580#issuecomment-651181336 @xhochy In https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/manylinux-test.sh#L33 . I'm sure there's a better way to do that (e.g. spawn a minimal Python docker image). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wesm commented on a change in pull request #7539: ARROW-9156: [C++] Reducing the code size of the tensor module
wesm commented on a change in pull request #7539: URL: https://github.com/apache/arrow/pull/7539#discussion_r447030010 ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -57,73 +57,86 @@ inline void IncrementIndex(std::vector& coord, const std::vector -class SparseCSFTensorConverter { - public: - using NumericTensorType = NumericTensor; - using value_type = typename NumericTensorType::value_type; +class SparseCSFTensorConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; - SparseCSFTensorConverter(const NumericTensorType& tensor, + public: + SparseCSFTensorConverter(const Tensor& tensor, const std::shared_ptr& index_value_type, MemoryPool* pool) : tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} - template Status Convert() { -using c_index_value_type = typename IndexValueType::c_type; - RETURN_NOT_OK(CheckMaximumValue(std::numeric_limits::max())); +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; +const int value_elsize = +checked_cast(*tensor_.type()).bit_width() / CHAR_BIT; Review comment: Could be helper function, maybe we can put an `internal::GetByteWidth` function in `arrow/type.h` that does this? ## File path: cpp/src/arrow/tensor/csx_converter.cc ## @@ -0,0 +1,245 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/tensor/converter.h" + +#include +#include +#include +#include + +#include "arrow/buffer.h" +#include "arrow/status.h" +#include "arrow/type.h" +#include "arrow/util/checked_cast.h" +#include "arrow/visitor_inline.h" + +namespace arrow { + +class MemoryPool; + +namespace internal { +namespace { + +// -- +// SparseTensorConverter for SparseCSRIndex + +class SparseCSXMatrixConverter : private SparseTensorConverterMixin { + using SparseTensorConverterMixin::AssignIndex; + using SparseTensorConverterMixin::IsNonZero; + + public: + SparseCSXMatrixConverter(SparseMatrixCompressedAxis axis, const Tensor& tensor, + const std::shared_ptr& index_value_type, + MemoryPool* pool) + : axis_(axis), tensor_(tensor), index_value_type_(index_value_type), pool_(pool) {} + + Status Convert() { +RETURN_NOT_OK(CheckSparseIndexMaximumValue(index_value_type_, tensor_.shape())); + +const int index_elsize = +checked_cast(*index_value_type_).bit_width() / CHAR_BIT; Review comment: Can you factor this "get byte width" operation into a helper function? ## File path: cpp/src/arrow/tensor/csf_converter.cc ## @@ -148,84 +161,130 @@ class SparseCSFTensorConverter { return Status::OK(); } -#define CALL_TYPE_SPECIFIC_CONVERT(TYPE_CLASS) \ - case TYPE_CLASS##Type::type_id: \ -return Convert(); - - Status Convert() { -switch (index_value_type_->id()) { - ARROW_GENERATE_FOR_ALL_INTEGER_TYPES(CALL_TYPE_SPECIFIC_CONVERT); - // LCOV_EXCL_START: The following invalid causes program failure. - default: -return Status::TypeError("Unsupported SparseTensor index value type"); -// LCOV_EXCL_STOP -} - } - -#undef CALL_TYPE_SPECIFIC_CONVERT - std::shared_ptr sparse_index; std::shared_ptr data; private: - const NumericTensorType& tensor_; + const Tensor& tensor_; const std::shared_ptr& index_value_type_; MemoryPool* pool_; +}; - template - inline Status CheckMaximumValue(const c_value_type type_max) const { -auto max_dimension = -*std::max_element(tensor_.shape().begin(), tensor_.shape().end()); -if (static_cast(type_max) < max_dimension) { - // LCOV_EXCL_START: The following invalid causes program failure. - return Status::Invalid("The bit width of the index value type is too small"); - // LCOV_EXCL_STOP -} -return Status::OK(); +class
[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values
rjzamora commented on pull request #7546: URL: https://github.com/apache/arrow/pull/7546#issuecomment-651177576 Thanks for the great work here @bkietz ! This is wonderful - Dask uses the min/max statistics to calculate `divisions`, so this functionality is definitely necessary. *A note on other (less-critical, but useful) statistics*: Dask also uses the `"total_byte_size"` statistics (for the full row-group, not each column) to aggregate partitions before reading in any data. There is also a plan to use the `"num-rows”` statistics when the user executes `len(ddf)` (to avoid loading any data). **How difficult would it be to add/expose these additional row-group statistics?** Again, this is much less of a “blocker” for initial integration with Dask, but are likely things we will want to add in eventually. cc @jorisvandenbossche This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs closed pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location
kszucs closed pull request #7376: URL: https://github.com/apache/arrow/pull/7376 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kszucs commented on pull request #7376: ARROW-9043: [Go][FOLLOWUP] Move license file copy to correct location
kszucs commented on pull request #7376: URL: https://github.com/apache/arrow/pull/7376#issuecomment-651170232 Thanks @jba for the update. Merging it then. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower
maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-651165626 @pitrou many thanks for the review. I've implemented all you suggestions except: * Raising an error on invalid utf8 data (see comment) * Having a benchmark run on non-ascii codepoints (I think we want to do this separate from this PR, but important point). Btw, I wasn't aware of existing utf8 code already in Arrow. The existing decoder based on http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. Very interesting work, but unfortunately led to a performance regression (~50->30 M/s), which I'm surprised about actually. Maybe worth comparing again when we have a benchmark with non-ascii codepoints. @wesm I hope this is ready to go This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org