[jira] [Created] (ARROW-8427) [C++][Dataset] Do not ignore file paths with underscore/dot when full path was specified
Joris Van den Bossche created ARROW-8427: Summary: [C++][Dataset] Do not ignore file paths with underscore/dot when full path was specified Key: ARROW-8427 URL: https://issues.apache.org/jira/browse/ARROW-8427 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 0.17.0 Currently, when passing a list of file path to FileSystemDatasetFactory, the files that have one of their parent directories with a underscore or dot are skipped. Since the file paths were passed as an explicit list, we should maybe not skip them. For example, when specifying a directory (Selector), it will only check child directories to skip, not parent directories. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8426) [Rust] [Parquet] Add support for writing dictionary types
Andy Grove created ARROW-8426: - Summary: [Rust] [Parquet] Add support for writing dictionary types Key: ARROW-8426 URL: https://issues.apache.org/jira/browse/ARROW-8426 Project: Apache Arrow Issue Type: Sub-task Reporter: Andy Grove -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8425) [Rust] [Parquet] Add support for writing timestamp types
Andy Grove created ARROW-8425: - Summary: [Rust] [Parquet] Add support for writing timestamp types Key: ARROW-8425 URL: https://issues.apache.org/jira/browse/ARROW-8425 Project: Apache Arrow Issue Type: Sub-task Reporter: Andy Grove -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8423) [Rust] [Parquet] Add support for writing integer types
Andy Grove created ARROW-8423: - Summary: [Rust] [Parquet] Add support for writing integer types Key: ARROW-8423 URL: https://issues.apache.org/jira/browse/ARROW-8423 Project: Apache Arrow Issue Type: Sub-task Reporter: Andy Grove -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8424) [Rust] [Parquet] Add support for writing floating point types
Andy Grove created ARROW-8424: - Summary: [Rust] [Parquet] Add support for writing floating point types Key: ARROW-8424 URL: https://issues.apache.org/jira/browse/ARROW-8424 Project: Apache Arrow Issue Type: Sub-task Reporter: Andy Grove -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8422) [Rust] Implement function to convert Arrow schema to Parquet schema
Andy Grove created ARROW-8422: - Summary: [Rust] Implement function to convert Arrow schema to Parquet schema Key: ARROW-8422 URL: https://issues.apache.org/jira/browse/ARROW-8422 Project: Apache Arrow Issue Type: Sub-task Reporter: Andy Grove Implement function to convert Arrow schema to Parquet schema -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8421) [Rust] [Parquet] Implement parquet writer
Andy Grove created ARROW-8421: - Summary: [Rust] [Parquet] Implement parquet writer Key: ARROW-8421 URL: https://issues.apache.org/jira/browse/ARROW-8421 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Fix For: 1.0.0 This is the parent story. See subtasks for more information. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8420) [C++] CMake fails to configure on armv7l platform (e.g. Raspberry Pi 3)
Wes McKinney created ARROW-8420: --- Summary: [C++] CMake fails to configure on armv7l platform (e.g. Raspberry Pi 3) Key: ARROW-8420 URL: https://issues.apache.org/jira/browse/ARROW-8420 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 Related to ARROW-8410, but probably will resolve the ARMv7 issues in a separate PR {code} $ cmake .. -DARROW_BUILD_TESTS=ON -DARROW_ORC=ON -DARROW_PARQUET=ON -DARROW_DEPENDENCY_SOURCE=BUNDLED -GNinja -- Building using CMake version: 3.13.4 -- The C compiler identification is GNU 8.3.0 -- The CXX compiler identification is GNU 8.3.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT') -- Arrow SO version: 100 (full: 100.0.0) -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29") -- clang-tidy not found -- clang-format not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found -- Found Python3: /usr/bin/python3.7 (found version "3.7.3") found components: Interpreter -- Found cpplint executable at /home/pi/code/arrow/cpp/build-support/cpplint.py -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Failed -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Failed -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Failed -- Arrow build warning level: PRODUCTION CMake Error at cmake_modules/SetupCxxFlags.cmake:318 (message): SSE4.2 required but compiler doesn't support it. Call Stack (most recent call first): CMakeLists.txt:399 (include) -- Configuring incomplete, errors occurred! See also "/home/pi/code/arrow/cpp/build/CMakeFiles/CMakeOutput.log". See also "/home/pi/code/arrow/cpp/build/CMakeFiles/CMakeError.log". {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8419) [C++] Default display for multi-choice define_option_string is misleading
Wes McKinney created ARROW-8419: --- Summary: [C++] Default display for multi-choice define_option_string is misleading Key: ARROW-8419 URL: https://issues.apache.org/jira/browse/ARROW-8419 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney While working on ARROW-8410, I noticed: {code} -- ARROW_SIMD_LEVEL=AVX2 [default=NONE|SSE4_2|AVX2|AVX512] -- SIMD compiler optimization level -- ARROW_ARMV8_ARCH=armv8-a+crc+crypto [default=armv8-a|armv8-a+crc+crypto] -- Arm64 arch and extensions {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8418) [Python] partition_filename_cb in write_to_dataset should be passed additional keyword arguments rather than just keys
Varun Patil created ARROW-8418: -- Summary: [Python] partition_filename_cb in write_to_dataset should be passed additional keyword arguments rather than just keys Key: ARROW-8418 URL: https://issues.apache.org/jira/browse/ARROW-8418 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Varun Patil I recently had a requirement where I would have liked to construct a filename based on additional context from Apache Airflow (specifically execution_date). It would be nice to pass the additional kwargs to *partition_filename_cb* so that the filename can be constructed using additional information rather than just the keys used for partitioning. I believe the fix should be as simple as passing kwargs to the *partition_filename_cb* inside *write_to_dataset.* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8417) [Packaging] Move the manylinux crossbow wheel builds to Githuba actions
Krisztian Szucs created ARROW-8417: -- Summary: [Packaging] Move the manylinux crossbow wheel builds to Githuba actions Key: ARROW-8417 URL: https://issues.apache.org/jira/browse/ARROW-8417 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs To free up some bandwidth on azure for the conda jobs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8416) [Python] Provide a "feather" alias in the dataset API
Joris Van den Bossche created ARROW-8416: Summary: [Python] Provide a "feather" alias in the dataset API Key: ARROW-8416 URL: https://issues.apache.org/jira/browse/ARROW-8416 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 I don't know what the plans are on the C++ side (ARROW-7586), but for 0.17, I think it would be nice if we can at least support {{ds.dataset(..., format="feather")}} (instead of needing to tell people to do {{ds.dataset(..., format="ipc")}} to read feather files). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8415) [C++] fix compilation error with GCC 4.8
Prudhvi Porandla created ARROW-8415: --- Summary: [C++] fix compilation error with GCC 4.8 Key: ARROW-8415 URL: https://issues.apache.org/jira/browse/ARROW-8415 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8414) [Python] Non-deterministic row order failure in test_parquet.py
Joris Van den Bossche created ARROW-8414: Summary: [Python] Non-deterministic row order failure in test_parquet.py Key: ARROW-8414 URL: https://issues.apache.org/jira/browse/ARROW-8414 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8413) Refactor DefLevelsToBitmap
Micah Kornfield created ARROW-8413: -- Summary: Refactor DefLevelsToBitmap Key: ARROW-8413 URL: https://issues.apache.org/jira/browse/ARROW-8413 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Assignee: Micah Kornfield The current code is should be split apart and made more efficient to consolidate logic need to support all nesting combinations. We need to be able to pass in an arbitrary min definitions level to prune away elements that aren't included in lists. The functionality is also somewhat replicated in reading the struct code, the two paths should be consolidated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8412) [C++][Gandiva] Fix gandiva date_diff function definitions
Projjal Chanda created ARROW-8412: - Summary: [C++][Gandiva] Fix gandiva date_diff function definitions Key: ARROW-8412 URL: https://issues.apache.org/jira/browse/ARROW-8412 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda The current gandiva date functions date_diff, date_sub definitions take integer as first argument and date as second argument: date_diff(10, d) = d - 10, which seems unintuitive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8411) [C++] gcc6 warning re: arrow::internal::ArgSort
Wes McKinney created ARROW-8411: --- Summary: [C++] gcc6 warning re: arrow::internal::ArgSort Key: ARROW-8411 URL: https://issues.apache.org/jira/browse/ARROW-8411 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Observed on a Debian platform with gcc6 base {code} In file included from /usr/include/c++/6/algorithm:62:0, from ../src/arrow/util/bit_util.h:55, from ../src/arrow/type_traits.h:26, from ../src/arrow/array.h:32, from ../src/arrow/compute/kernel.h:24, from ../src/arrow/dataset/filter.h:27, from ../src/arrow/dataset/partition.h:27, from /home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18: /usr/include/c++/6/bits/stl_algo.h: In function 'void std::__insertion_sort(_RandomAccessIterator, _RandomAccessIterator, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Compare = __gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = std::less >]:: >]': /usr/include/c++/6/bits/stl_algo.h:1837:5: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 __insertion_sort(_RandomAccessIterator __first, ^~~~ /usr/include/c++/6/bits/stl_algo.h:1837:5: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 In file included from /usr/include/c++/6/bits/stl_algo.h:61:0, from /usr/include/c++/6/algorithm:62, from ../src/arrow/util/bit_util.h:55, from ../src/arrow/type_traits.h:26, from ../src/arrow/array.h:32, from ../src/arrow/compute/kernel.h:24, from ../src/arrow/dataset/filter.h:27, from ../src/arrow/dataset/partition.h:27, from /home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18: /usr/include/c++/6/bits/stl_heap.h: In function 'void std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Distance = int; _Tp = long long int; _Compare = __gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = std::less >]:: >]': /usr/include/c++/6/bits/stl_heap.h:209:5: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 __adjust_heap(_RandomAccessIterator __first, _Distance __holeIndex, ^ In file included from /usr/include/c++/6/algorithm:62:0, from ../src/arrow/util/bit_util.h:55, from ../src/arrow/type_traits.h:26, from ../src/arrow/array.h:32, from ../src/arrow/compute/kernel.h:24, from ../src/arrow/dataset/filter.h:27, from ../src/arrow/dataset/partition.h:27, from /home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18: /usr/include/c++/6/bits/stl_algo.h: In function 'void std::__introsort_loop(_RandomAccessIterator, _RandomAccessIterator, _Size, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Size = int; _Compare = __gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = std::less >]:: >]': /usr/include/c++/6/bits/stl_algo.h:1937:5: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 __introsort_loop(_RandomAccessIterator __first, ^~~~ /usr/include/c++/6/bits/stl_algo.h:1937:5: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 /usr/include/c++/6/bits/stl_algo.h:1951:4: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 std::__introsort_loop(__cut, __last, __depth_limit, __comp); ^~~ /usr/include/c++/6/bits/stl_algo.h: In function 'std::vector arrow::internal::ArgSort(const std::vector&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = std::less >]': /usr/include/c++/6/bits/stl_algo.h:1882:4: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 std::__insertion_sort(__first, __first + int(_S_threshold), __comp); ^~~ /usr/include/c++/6/bits/stl_algo.h:1887:2: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 std::__insertion_sort(__first, __last, __comp); ^~~ /usr/include/c++/6/bits/stl_algo.h:1965:4: note: parameter passing for argument of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1 std::__introsort_loop(__first, __last, ^~~ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8410) [C++] CMake fails on aarch64 systems that do not support -march=armv8-a+crc+crypto
Wes McKinney created ARROW-8410: --- Summary: [C++] CMake fails on aarch64 systems that do not support -march=armv8-a+crc+crypto Key: ARROW-8410 URL: https://issues.apache.org/jira/browse/ARROW-8410 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 I was trying to build the project on a rockpro64 system to look into something else and ran into this {code} -- Arrow build warning level: PRODUCTION CMake Error at cmake_modules/SetupCxxFlags.cmake:332 (message): Unsupported arch flag: -march=armv8-a+crc+crypto. Call Stack (most recent call first): CMakeLists.txt:398 (include) -- Configuring incomplete, errors occurred! {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8409) [R] Add arrow::cpu_count, arrow::set_cpu_count wrapper functions a la Python
Wes McKinney created ARROW-8409: --- Summary: [R] Add arrow::cpu_count, arrow::set_cpu_count wrapper functions a la Python Key: ARROW-8409 URL: https://issues.apache.org/jira/browse/ARROW-8409 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Fix For: 0.17.0 While some people will configure these with {{$OMP_NUM_THREADS}}, it is useful to be able to configure the global thread pool dynamically -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8408) [Python] Add memory_map= toggle to pyarrow.feather.read_feather
Wes McKinney created ARROW-8408: --- Summary: [Python] Add memory_map= toggle to pyarrow.feather.read_feather Key: ARROW-8408 URL: https://issues.apache.org/jira/browse/ARROW-8408 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.17.0 I missed this in my prior patch -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8407) [Rust] Add rustdoc for Dictionary type
Andy Grove created ARROW-8407: - Summary: [Rust] Add rustdoc for Dictionary type Key: ARROW-8407 URL: https://issues.apache.org/jira/browse/ARROW-8407 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.17.0 Add rustdoc for Dictionary type -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8406) [Python] FileSystem.from_uri erases the drive on Windows
Krisztian Szucs created ARROW-8406: -- Summary: [Python] FileSystem.from_uri erases the drive on Windows Key: ARROW-8406 URL: https://issues.apache.org/jira/browse/ARROW-8406 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Krisztian Szucs {code:python} path = "C:\Users\VssAdministrator\AppData\Local\Temp\pytest-of-VssAdministrator\pytest-0\test_construct_from_single_fil0\single-file" _, path = FileSystem.from_uri(path) path == "/Users/VssAdministrator/AppData/Local/Temp/pytest-of-VssAdministrator/pytest-0/test_construct_from_single_fil0/single-file" {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8405) [Gandiva][UDF] Support complex datatype for UDF return type.
ZMZ91 created ARROW-8405: Summary: [Gandiva][UDF] Support complex datatype for UDF return type. Key: ARROW-8405 URL: https://issues.apache.org/jira/browse/ARROW-8405 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: ZMZ91 Is it possible to return a complex datatype for a UDF, like vector or event dictionary? Checked [https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/types.h] and found the types used there are all basic datatypes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8404) Read and write dataset description in both R and Python
Vincent Nijs created ARROW-8404: --- Summary: Read and write dataset description in both R and Python Key: ARROW-8404 URL: https://issues.apache.org/jira/browse/ARROW-8404 Project: Apache Arrow Issue Type: New Feature Reporter: Vincent Nijs Below a feature request for feather. Wes suggested opening an issue here. The idea is to add metadata to a data frame to store and display information about the data (e.g., variable descriptions, data source, main company contact about data, changes, etc. etc.). For a simple example in R (+ shiny) that uses a "description" attribute in markdown format and then renders it in HTML when loaded, see the link below. See the description for the diamonds data. [https://vnijs.shinyapps.io/radiant] Having a data format that works for both R and Python *and* maintains attributes like a data description would be great! [https://github.com/wesm/feather/issues/328] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8403) [C++] Add ToString() to ChunkedArray, Table and RecordBatch
Kouhei Sutou created ARROW-8403: --- Summary: [C++] Add ToString() to ChunkedArray, Table and RecordBatch Key: ARROW-8403 URL: https://issues.apache.org/jira/browse/ARROW-8403 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8402) [Java] Support ValidateFull methods in Java
Liya Fan created ARROW-8402: --- Summary: [Java] Support ValidateFull methods in Java Key: ARROW-8402 URL: https://issues.apache.org/jira/browse/ARROW-8402 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan We need to support ValidateFull methods in Java, just like we do in C++. This is required by ARROW-5926. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8401) [C++] Add AVX2/AVX512 version of ByteStreamSplitDecode/ByteStreamSplitEncode
Frank Du created ARROW-8401: --- Summary: [C++] Add AVX2/AVX512 version of ByteStreamSplitDecode/ByteStreamSplitEncode Key: ARROW-8401 URL: https://issues.apache.org/jira/browse/ARROW-8401 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Frank Du Assignee: Frank Du Add AVX2/AVX512 version of ByteStreamSplitDecode/ByteStreamSplitEncode, it should similar to the SSE implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8400) [Python][Dataset] Infer the filesystem from the first path if multiple paths are passed to dataset()
Krisztian Szucs created ARROW-8400: -- Summary: [Python][Dataset] Infer the filesystem from the first path if multiple paths are passed to dataset() Key: ARROW-8400 URL: https://issues.apache.org/jira/browse/ARROW-8400 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs See conversation https://github.com/apache/arrow/pull/6505#discussion_r406677317 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8399) [Rust] Extend memory alignments to include other architectures
Mahmut Bulut created ARROW-8399: --- Summary: [Rust] Extend memory alignments to include other architectures Key: ARROW-8399 URL: https://issues.apache.org/jira/browse/ARROW-8399 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Mahmut Bulut Assignee: Mahmut Bulut Currently, alignment of allocation is fixed with 64 and this enables most of the architectures, but not all L1D prefetching systems are the same and some of the architectures are using double prefetching like x86_64. Include a matrix of alignment values to extend the cache alignments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8398) [Python] Remove deprecation warnings originating from python tests
Krisztian Szucs created ARROW-8398: -- Summary: [Python] Remove deprecation warnings originating from python tests Key: ARROW-8398 URL: https://issues.apache.org/jira/browse/ARROW-8398 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs See build log https://travis-ci.org/github/ursa-labs/crossbow/builds/673385834#L6846 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8397) [C++] Fail to compile aggregate_test.cc on Ubuntu 16.04
Krisztian Szucs created ARROW-8397: -- Summary: [C++] Fail to compile aggregate_test.cc on Ubuntu 16.04 Key: ARROW-8397 URL: https://issues.apache.org/jira/browse/ARROW-8397 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Krisztian Szucs Assignee: Krisztian Szucs See build log https://app.circleci.com/pipelines/github/ursa-labs/crossbow/31122/workflows/b250d378-52a8-4d15-9909-96474fa38482/jobs/10840 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8396) [Rust] Remove libc from dependencies
Mahmut Bulut created ARROW-8396: --- Summary: [Rust] Remove libc from dependencies Key: ARROW-8396 URL: https://issues.apache.org/jira/browse/ARROW-8396 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Mahmut Bulut Code has been removed that use libc calls but dependency sits in there. We can remove it before the next release. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8395) [Python] conda install pyarrow defaults to 0.11.1 not 0.16.0
dwang created ARROW-8395: Summary: [Python] conda install pyarrow defaults to 0.11.1 not 0.16.0 Key: ARROW-8395 URL: https://issues.apache.org/jira/browse/ARROW-8395 Project: Apache Arrow Issue Type: Improvement Components: Python Environment: ubuntu 16, ubuntu 18, anaconda 2020.02 x64 Reporter: dwang When install pyarrow in clean linux conda environment (2020.02): {code:java} conda install -c conda-forge pyarrow The following packages will be downloaded:package| build ---|- arrow-cpp-0.11.1 |py37h0e61e49_1004 6.3 MB conda-forge boost-cpp-1.68.0 |h11c811c_100020.5 MB conda-forge conda-4.8.3| py37hc8dfbb8_1 3.0 MB conda-forge libprotobuf-3.6.1 |hdbcaa40_1001 4.0 MB conda-forge parquet-cpp-1.5.1 |3 3 KB conda-forge pyarrow-0.11.1 |py37hbbcf98d_1002 2.0 MB conda-forge python_abi-3.7 | 1_cp37m 4 KB conda-forge thrift-cpp-0.12.0 |h0a07b25_1002 2.4 MB conda-forge Total:38.2 MB {code} The default version is pyarrow-0.11.1, while conda repo actually has the latest version 0.16.0 ( [https://anaconda.org/conda-forge/pyarrow] ). Specify the version does not help: conda install -c conda-forge pyarrow=0.16.0 Workaround: I have to manually download below packages from conda then install them locally: arrow-cpp-0.16.0-py37hb0edad2_0.tar.bz2 aws-sdk-cpp-1.7.164-h1f8afcc_0.tar.bz2 boost-cpp-1.70.0-h8e57a91_2.tar.bz2 brotli-1.0.7-he1b5a44_1000.tar.bz2 c-ares-1.15.0-h516909a_1001.tar.bz2 gflags-2.2.2-he1b5a44_1002.tar.bz2 glog-0.4.0-he1b5a44_1.tar.bz2 grpc-cpp-1.25.0-h213be95_2.tar.bz2 libprotobuf-3.11.3-h8b12597_0.tar.bz2 lz4-c-1.8.3-he1b5a44_1001.tar.bz2 parquet-cpp-1.5.1-1.tar.bz2 pyarrow-0.16.0-py37h8b68381_1.tar.bz2 re2-2020.01.01-he1b5a44_0.tar.bz2 snappy-1.1.8-he1b5a44_1.tar.bz2 thrift-cpp-0.12.0-hf3afdfd_1004.tar.bz2 zstd-1.4.4-h3b9ef0a_1.tar.bz2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package
Shyamal Shukla created ARROW-8394: - Summary: Typescript compiler errors for arrow d.ts files, when using es2015-esm package Key: ARROW-8394 URL: https://issues.apache.org/jira/browse/ARROW-8394 Project: Apache Arrow Issue Type: Bug Components: JavaScript Affects Versions: 0.16.0 Reporter: Shyamal Shukla Attempting to use apache-arrow within a web application, but typescript compiler throws the following errors in some of arrow's .d.ts files import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; export class SomeClass { . . constructor() { const t = Table.from(''); } *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: Class static side 'typeof Column' incorrectly extends base class static side 'typeof Chunked'. Types of property 'new' are incompatible. *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: Subsequent property declarations must have the same type. Property 'schema' must be of type 'Schema', but here has type 'Schema'. 238 schema: Schema; *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. The types of 'slice(...).clone' are incompatible between these types. the tsconfig.json file looks like { "compilerOptions": { "target":"ES6", "outDir": "dist", "baseUrl": "src/" }, "exclude": ["dist"], "include": ["src/*.ts"] } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8393) [C++][Gandiva] Make gandiva function registry case-insensitive
Projjal Chanda created ARROW-8393: - Summary: [C++][Gandiva] Make gandiva function registry case-insensitive Key: ARROW-8393 URL: https://issues.apache.org/jira/browse/ARROW-8393 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Projjal Chanda Assignee: Projjal Chanda -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison
Liya Fan created ARROW-8392: --- Summary: [Java] Fix overflow related corner cases for vector value comparison Key: ARROW-8392 URL: https://issues.apache.org/jira/browse/ARROW-8392 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan 1. Fix corner cases related to overflow. 2. Provide test cases for the corner cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8391) [C++] Implement row range read API for IPC file (and Feather)
Wes McKinney created ARROW-8391: --- Summary: [C++] Implement row range read API for IPC file (and Feather) Key: ARROW-8391 URL: https://issues.apache.org/jira/browse/ARROW-8391 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney The objective would be able to read a range of rows from the middle of a file. It's not as easy as it might sound since all the record batch metadata must be examined to determine the start and end point of the row range -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8390) [R] Expose schema unification features
Neal Richardson created ARROW-8390: -- Summary: [R] Expose schema unification features Key: ARROW-8390 URL: https://issues.apache.org/jira/browse/ARROW-8390 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8389) [Integration] Run tests in parallel
Antoine Pitrou created ARROW-8389: - Summary: [Integration] Run tests in parallel Key: ARROW-8389 URL: https://issues.apache.org/jira/browse/ARROW-8389 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Integration Reporter: Antoine Pitrou This follows ARROW-8176. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8388) [C++] GCC 4.8 fails to move on return
Ben Kietzman created ARROW-8388: --- Summary: [C++] GCC 4.8 fails to move on return Key: ARROW-8388 URL: https://issues.apache.org/jira/browse/ARROW-8388 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 0.17.0 See https://github.com/apache/arrow/pull/6883#issuecomment-611661733 This is a recurring problem which usually shows up as a broken nightly (the gandiva nightly jobs, specifically) along with similar issues due to gcc 4.8's incomplete handling of c++11. As long as someone depends on these we should probably have an every-commit CI job which checks we haven't introduced such a breakage -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8387) [rust] Make schema_to_fb public because it is very useful!
Max Burke created ARROW-8387: Summary: [rust] Make schema_to_fb public because it is very useful! Key: ARROW-8387 URL: https://issues.apache.org/jira/browse/ARROW-8387 Project: Apache Arrow Issue Type: Improvement Reporter: Max Burke Make schema_to_fb public because it is very useful! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays
Bryan Cutler created ARROW-8386: --- Summary: [Python] pyarrow.jvm raises error for empty Arrays Key: ARROW-8386 URL: https://issues.apache.org/jira/browse/ARROW-8386 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Bryan Cutler Assignee: Bryan Cutler In the pyarrow.jvm module, when there is an empty array in Java, trying to create it in python raises a ValueError. This is because for an empty array, Java returns an empty list of buffers, then pyarrow.jvm attempts to create the array with pa.Array.from_buffers with an empty list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8385) Crash on parquet.read_table on windows python 3.82
Geoff Quested-Joens created ARROW-8385: -- Summary: Crash on parquet.read_table on windows python 3.82 Key: ARROW-8385 URL: https://issues.apache.org/jira/browse/ARROW-8385 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Environment: Window 10 python 3.8.2 pip 20.0.2 pip freeze -> numpy==1.18.2 pandas==1.0.3 pyarrow==0.16.0 python-dateutil==2.8.1 pytz==2019.3 six==1.14.0 Reporter: Geoff Quested-Joens Attachments: crash.parquet On read of parquet file using pyarrow the program spontaneously exits no thrown exceptions windows only. Testing the same setup on linux (debian 10 in a Docker) reading the same parquet file is done without issue. The follow can reproduce the crash in a python 3.8.2 environment env listed bellow but is essentially pip install pandas and pyarrow. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def test_pandas_write_read(): df_out = pd.DataFrame.from_dict([{"A":i} for i in range(3)]) df_out.to_parquet("crash.parquet") df_in = pd.read_parquet("crash.parquet") print(df_in) def test_arrow_write_read(): df = pd.DataFrame.from_dict([{"A":i} for i in range(3)]) table_out = pa.Table.from_pandas(df) pq.write_table(table_out, 'crash.parquet') table_in = pq.read_table('crash.parquet') print(table_in) if _name_ == "_main_": test_pandas_write_read() test_arrow_write_read() {code} The interpreter never reaches the print statements crashing somewhere in the call on line 252 of {{parquet.py}} no error is thrown just spontaneous program exit. {code:python} self.reader.read_all(... {code} In contrast running the same code and python environment in debian 10 there is no error reading the parquet files generated by the same windows code. The sha2sum compare equal for the crash.parquet generated running on debian and windows so something appears to be up with the read. Attached is the crash.parquet file generated on my machine. Obtusely changing the {{range(3)}} to {{range(2)}} gets rid of the crash on windows. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8384) [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket
Wes McKinney created ARROW-8384: --- Summary: [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket Key: ARROW-8384 URL: https://issues.apache.org/jira/browse/ARROW-8384 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney This feature seems to have been dropped Is there a plan for migrating users to the new filesystem API? We have two different code paths now -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8383) [RUST] Easier random access to DictionaryArray keys and values
Jörn Horstmann created ARROW-8383: - Summary: [RUST] Easier random access to DictionaryArray keys and values Key: ARROW-8383 URL: https://issues.apache.org/jira/browse/ARROW-8383 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jörn Horstmann Currently it's not that clear how to acces DictionaryArray keys and values using random indices. The `DictionaryArray::keys` method exposes an Iterator with an `nth` method, but this requires a mut reference and feels a little bit out of place compared to other methods of accessing arrow data. Another alternative seems to be to use the `From for PrimitiveArray` conversion like so `let keys : Int16Array = dictionary_array.data().into()`. This seems to work fine but is not easily discoverable and also needs to be done outside of any loops for performance reasons. I'd like methods on `DictionaryArray` to directly get the key at some index ``` pub fn key(&self, i: usize) -> &K ``` Ideally I'd also like an easier way to directly access values at some index, at least when those are primitive or string types. ``` pub fn value(&self, i: usize) -> &T ``` I'm not sure how or if that would be possible to implement with rust generics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes
Francois Saint-Jacques created ARROW-8382: - Summary: [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes Key: ARROW-8382 URL: https://issues.apache.org/jira/browse/ARROW-8382 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques WritePlan should look like the following. {code:c++} class ARROW_DS_EXPORT WritePlan { public: /// Execute the WritePlan and return a FileSystemDataset as a result. Result Execute(); protected: /// The schema of the Dataset which will be written std::shared_ptr schema; /// The format into which fragments will be written std::shared_ptr format; using SourceAndReader = std::pair; /// std::vector outputs; }; {code} * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not sure if it should take the output schema, or the RecordBatchReader should be already of the right schema. * Add a class/function that constructs SourceAndReader from Fragments, Partitioning and base path. And remove any Write/Fragment logic from partition.cc. * Move Write() out FIleSystemDataset into WritePlan. It could take a FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, not a requirement. * Simplify writing routine to avoid the PathTree directory structure, it shouldn't be more complex than `for task in write_tasks: task()`. Not path construction should there. The effects are: * Simplified WritePlan execution, abstracted away from path construction, and can write to multiple FileSystem and/or Buffers since it doesn't construct the FileSource. * By the virtue of using RecordBatchReader instead of Fragment, it isn't tied to writing from Fragment, it can take any construct that yields a RecordBatchReader. It also means that WritePlan doesn't have to know about any Scan related classes. * Writing can be done with or without partitioning, this logic is given to whomever generates the SourceAndReader list. * Should be simpler to test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8381) [C++][Dataset] Dataset writing should require a writer schema
Francois Saint-Jacques created ARROW-8381: - Summary: [C++][Dataset] Dataset writing should require a writer schema Key: ARROW-8381 URL: https://issues.apache.org/jira/browse/ARROW-8381 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Francois Saint-Jacques # Dataset writing should always take an explicit writer schema instead of the first fragment's schema. # The MakeWritePlanImpl should not try removing columns that are found in the partition, this is left to the caller by passing an explicit schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8380) [RUST] StringDictionaryBuilder not publicly exported from arrow::array
Jörn Horstmann created ARROW-8380: - Summary: [RUST] StringDictionaryBuilder not publicly exported from arrow::array Key: ARROW-8380 URL: https://issues.apache.org/jira/browse/ARROW-8380 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jörn Horstmann -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)
Neal Richardson created ARROW-8379: -- Summary: [R] Investigate/fix thread safety issues (esp. Windows) Key: ARROW-8379 URL: https://issues.apache.org/jira/browse/ARROW-8379 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson There have been a number of issues where the R bindings' multithreading has been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 I disabled {{use_threads}} in the Windows tests, and it appeared that the mysterious Windows segfaults stopped. We should fix whatever the underlying issues are. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8378) [Python] "empty" dtype metadata leads to wrong Parquet column type
Diego Argueta created ARROW-8378: Summary: [Python] "empty" dtype metadata leads to wrong Parquet column type Key: ARROW-8378 URL: https://issues.apache.org/jira/browse/ARROW-8378 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Environment: Python: 3.7.6 Pandas: 0.24.1, 0.25.3, 1.0.3 Pyarrow: 0.16.0 OS: OSX 10.15.3 Reporter: Diego Argueta Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python 3.7: {code:python} import pandas as pd import numpy as np df_1 = pd.DataFrame({'col': [None, None, None]}) df_1.col = df_1.col.astype(np.unicode_) df_1.to_parquet('right.parq', engine='pyarrow') series = pd.Series([None, None, None], dtype=np.unicode_) df_2 = pd.DataFrame({'col': series}) df_2.to_parquet('wrong.parq', engine='pyarrow') {code} Examine the Parquet column type for each file (I use [parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}} has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}. The following metadata is stored in the Parquet files: {{right.parq}} {code:json} { "column_indexes": [], "columns": [ { "field_name": "col", "metadata": null, "name": "col", "numpy_type": "object", "pandas_type": "unicode" } ], "index_columns": [], "pandas_version": "0.24.1" } {code} {{wrong.parq}} {code:json} { "column_indexes": [], "columns": [ { "field_name": "col", "metadata": null, "name": "col", "numpy_type": "object", "pandas_type": "empty" } ], "index_columns": [], "pandas_version": "0.24.1" } {code} The difference between the two is that the {{pandas_type}} for the incorrect file is "empty" rather than the expected "unicode". PyArrow misinterprets this and defaults to a 32-bit integer column. The incorrect datatype will cause Redshift to reject the file when we try to read it because the column type in the file doesn't match the column type in the database table. I originally filed this as a bug in Pandas (see [this ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me over here because the dtype conversion is handled in PyArrow. I'm not sure how you'd handle this here. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8377) [CI][C++][R] Build and run C++ tests on Rtools build
Neal Richardson created ARROW-8377: -- Summary: [CI][C++][R] Build and run C++ tests on Rtools build Key: ARROW-8377 URL: https://issues.apache.org/jira/browse/ARROW-8377 Project: Apache Arrow Issue Type: New Feature Components: C++, Continuous Integration, R Reporter: Neal Richardson Maybe this will better identify our unexplained segfaults -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8376) [R] Add experimental interface to ScanTask/RecordBatch iterators
Neal Richardson created ARROW-8376: -- Summary: [R] Add experimental interface to ScanTask/RecordBatch iterators Key: ARROW-8376 URL: https://issues.apache.org/jira/browse/ARROW-8376 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8375) [CI][R] Make Windows tests more verbose in case of segfault
Neal Richardson created ARROW-8375: -- Summary: [CI][R] Make Windows tests more verbose in case of segfault Key: ARROW-8375 URL: https://issues.apache.org/jira/browse/ARROW-8375 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array
Francois Saint-Jacques created ARROW-8374: - Summary: [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array Key: ARROW-8374 URL: https://issues.apache.org/jira/browse/ARROW-8374 Project: Apache Arrow Issue Type: Bug Reporter: Francois Saint-Jacques The conversion should accommodate Unifying the dictionary before converting, otherwise the indices are simply broken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8373) [GLib] Problems resolving gobject-introspection, arrow in Meson builds
Wes McKinney created ARROW-8373: --- Summary: [GLib] Problems resolving gobject-introspection, arrow in Meson builds Key: ARROW-8373 URL: https://issues.apache.org/jira/browse/ARROW-8373 Project: Apache Arrow Issue Type: Bug Components: GLib Reporter: Wes McKinney Fix For: 0.17.0 See example failure https://github.com/apache/arrow/pull/6872/checks?check_run_id=571082161 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8372) [C++] Add Result to table / record batch APIs
Antoine Pitrou created ARROW-8372: - Summary: [C++] Add Result to table / record batch APIs Key: ARROW-8372 URL: https://issues.apache.org/jira/browse/ARROW-8372 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Assignee: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8371) [Crossbow] Implement and exercise sanity checks for tasks.yml
Krisztian Szucs created ARROW-8371: -- Summary: [Crossbow] Implement and exercise sanity checks for tasks.yml Key: ARROW-8371 URL: https://issues.apache.org/jira/browse/ARROW-8371 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs See conversation at https://github.com/apache/arrow/pull/6868#issuecomment-610721717 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8370) [C++] Add Result to type / schema APIs
Antoine Pitrou created ARROW-8370: - Summary: [C++] Add Result to type / schema APIs Key: ARROW-8370 URL: https://issues.apache.org/jira/browse/ARROW-8370 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Buffers, Array builders (anythings in the parent directory src/arrow root directory) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8369) [CI] Fix crossbow R group
Neal Richardson created ARROW-8369: -- Summary: [CI] Fix crossbow R group Key: ARROW-8369 URL: https://issues.apache.org/jira/browse/ARROW-8369 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Neal Richardson Assignee: Neal Richardson This was broken in ARROW-8356 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8368) [Format] In C interface, clarify resource management for consumers needing only a subset of child fields in ArrowArray
Wes McKinney created ARROW-8368: --- Summary: [Format] In C interface, clarify resource management for consumers needing only a subset of child fields in ArrowArray Key: ARROW-8368 URL: https://issues.apache.org/jira/browse/ARROW-8368 Project: Apache Arrow Issue Type: Improvement Components: Format Reporter: Wes McKinney The current implication of the C Interface is that only moving a single child out of an ArrowArray is allowed. Questions: * Should it be allowed to move multiple children, as long as they are moved at the same time, and the parent is released after? * In the event that children have disjoint internal resources, should there be a clarification around moved children having their resources released independently? See mailing list discussion https://lists.apache.org/thread.html/r92b77e0fa7bed384daa377e2178bc8e8ca46103928598050341e40b1%40%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8367) [C++] Is FromString(..., pool) worthwhile
Ben Kietzman created ARROW-8367: --- Summary: [C++] Is FromString(..., pool) worthwhile Key: ARROW-8367 URL: https://issues.apache.org/jira/browse/ARROW-8367 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.16.0 Reporter: Ben Kietzman Fix For: 1.0.0 >From [https://github.com/apache/arrow/pull/6863#discussion_r404913683] There are currently two overloads of {{Buffer::FromString}}, one which takes an rvalue reference to string and another which takes a const reference and a MemoryPool. In the former case the string is simply moved into a Buffer subclass while in the latter the MemoryPool is used to allocate space into which the string's contents are copied, which necessitates bubbling the potential allocation failure. This seems gratuitous given we don't use {{std::string}} to store large quantities so it should be fine to provide only {code:java} static std::unique_ptr FromString(std::string data); {code} and rely on {{std::string}}'s copy constructor when the argument is not an rvalue. In the case of a {{std::string}} which may/does contain large data and must be copied, tracking the copied memory with a MemoryPool does not require a great deal of boilerplate: {code:java} ARROW_ASSIGN_OR_RAISE(auto buffer, Buffer(large).CopySlice(0, large.size(), pool)); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8366) [Rust] Need to revert recent arrow-flight build change
Andy Grove created ARROW-8366: - Summary: [Rust] Need to revert recent arrow-flight build change Key: ARROW-8366 URL: https://issues.apache.org/jira/browse/ARROW-8366 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.17.0 The PR [1] merged for ARROW-7794 causes problems with projects that have a dependency on this crate where the build.rs code becomes an infinite loop looking for a parent directory named "arrow" that doesn't exist. This PR simply reverts that change. I will need to find a better approach to resolving the original issue. [1] https://github.com/apache/arrow/pull/6858 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8365) arrow-cpp: Error when writing files to S3 larger than 5 GB
Juan Galvez created ARROW-8365: -- Summary: arrow-cpp: Error when writing files to S3 larger than 5 GB Key: ARROW-8365 URL: https://issues.apache.org/jira/browse/ARROW-8365 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Reporter: Juan Galvez When purely using the arrow-cpp library to write to S3, I get the following error when trying to write a large Arrow table to S3 (resulting in a file size larger than 5 GB): {{../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of type N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading part for key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error [code 100]: Unable to parse ExceptionName: EntityTooLarge Message: Your proposed upload exceeds the maximum allowed size with address : 52.219.100.32}} I have diagnosed the problem by looking at and modifying the code in *{{s3fs.cc}}*. The code uses multipart upload, and uses 5 MB chunks for the first 100 parts. After it has submitted the first 100 parts, it is supposed to increase the size of the chunks to 10 MB (the part upload threshold or {{part_upload_threshold_}}). The issue is that the threshold is increased inside {{DoWrite}}, and {{DoWrite}} can be called multiple times before the current part is uploaded, which ultimately causes the threshold to keep getting increased indefinitely, and the last part ends up surpassing the 5 GB part upload limit of AWS/S3. This issue where the last part is much larger than it should I'm pretty sure can happen every time a multi-part upload exceeds 100 parts, but the error is only thrown if the last part is larger than 5 GB. Therefore this is only observed with very large uploads. I can confirm that the bug does not happen if I move this: {{if (part_number_ % 100 == 0) {}} {{ part_upload_threshold_ += kMinimumPartUpload;}} {{ }}} and do it in a different method, right before the line that does: {{++part_number_}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8364) Get Access to the type_to_type_id dictionary
Or created ARROW-8364: - Summary: Get Access to the type_to_type_id dictionary Key: ARROW-8364 URL: https://issues.apache.org/jira/browse/ARROW-8364 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Or Hi, h3. *The Problem:* Currently, if I try to serialize and it can't be serialized by the default serialization context I get SerializationCallbackError. So the problem is that I have to try to serialize the object in order to know if it is serializable by the package. That can be a very expensive operation for a simple check if the object contains a large amount of data. h3. *The* *Requested* *Improvement / Feature**:* A function that checks if the type of the object I'm about to serialize is serializable by the package (meaning it is registered under the type_to_type_id dictionary). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8363) [Archery] Comment bot should report any errors happening during crossbow submit
Krisztian Szucs created ARROW-8363: -- Summary: [Archery] Comment bot should report any errors happening during crossbow submit Key: ARROW-8363 URL: https://issues.apache.org/jira/browse/ARROW-8363 Project: Apache Arrow Issue Type: Task Components: Archery Reporter: Krisztian Szucs We already get a feedback to the github comment, but no error message. Example failure https://github.com/apache/arrow/runs/567644496?check_suite_focus=true#step:5:42 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8362) [Crossbow] Ensure that the locally generated version is used in the docker tasks
Krisztian Szucs created ARROW-8362: -- Summary: [Crossbow] Ensure that the locally generated version is used in the docker tasks Key: ARROW-8362 URL: https://issues.apache.org/jira/browse/ARROW-8362 Project: Apache Arrow Issue Type: Task Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.17.0 Arrow fork might not have the version tags, so the scm based version generation can't work. Pass the locally detected version to the docker builds. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8361) [C++] Add Result APIs to Buffer methods and functions
Antoine Pitrou created ARROW-8361: - Summary: [C++] Add Result APIs to Buffer methods and functions Key: ARROW-8361 URL: https://issues.apache.org/jira/browse/ARROW-8361 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Assignee: Antoine Pitrou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8360) Fixes date32 support for date/time functions
Yuan Zhou created ARROW-8360: Summary: Fixes date32 support for date/time functions Key: ARROW-8360 URL: https://issues.apache.org/jira/browse/ARROW-8360 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Yuan Zhou Assignee: Yuan Zhou Gandiva date/time functions like extractYear[1] only work with millisecond, passing date32 to these functions will get wrong results. [1]https://github.com/apache/arrow/blob/6d92694d00aec08081ae1bfe06f0a265e141b1b7/cpp/src/gandiva/precompiled/time.cc#L75-L80 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8359) [C++/Python] Enable aarch64/ppc64le build in conda recipes
Uwe Korn created ARROW-8359: --- Summary: [C++/Python] Enable aarch64/ppc64le build in conda recipes Key: ARROW-8359 URL: https://issues.apache.org/jira/browse/ARROW-8359 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging, Python Reporter: Uwe Korn Fix For: 0.17.0 These two new arches were added in the conda recipes, we should also build them as nightlies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8358) [C++] Fix -Wrange-loop-construct warnings in clang-11
Wes McKinney created ARROW-8358: --- Summary: [C++] Fix -Wrange-loop-construct warnings in clang-11 Key: ARROW-8358 URL: https://issues.apache.org/jira/browse/ARROW-8358 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney We might change one of our CI entries to use clang-11 so we get some more bleeding edge compiler warnings, to get out ahead of things -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8357) [Rust] [DataFusion] Dockerfile for CLI is missing format dir
Andy Grove created ARROW-8357: - Summary: [Rust] [DataFusion] Dockerfile for CLI is missing format dir Key: ARROW-8357 URL: https://issues.apache.org/jira/browse/ARROW-8357 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.17.0 {code:java} error: failed to run custom build command for `arrow-flight v1.0.0-SNAPSHOT (/arrow/rust/arrow-flight)`Caused by: process didn't exit successfully: `/arrow/rust/target/release/build/arrow-flight-a0fb14daffea70f5/build-script-build` (exit code: 1) --- stderr Error: Custom { kind: Other, error: "protoc failed: ../../format: warning: directory does not exist.\nCould not make proto path relative: ../../format/Flight.proto: No such file or directory\n" }warning: build failed, waiting for other jobs to finish... error: failed to compile `datafusion v1.0.0-SNAPSHOT (/arrow/rust/datafusion)`, intermediate artifacts can be found at `/arrow/rust/target`Caused by: build failed {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8356) [Developer] Support * wildcards with "crossbow submit" via GitHub actions
Wes McKinney created ARROW-8356: --- Summary: [Developer] Support * wildcards with "crossbow submit" via GitHub actions Key: ARROW-8356 URL: https://issues.apache.org/jira/browse/ARROW-8356 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Wes McKinney While the "group" feature can be useful, sometimes there is a group of builds that do not fit neatly into a particular group -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8355) [Python] Reduce the number of pandas dependent test cases in test_feather
Krisztian Szucs created ARROW-8355: -- Summary: [Python] Reduce the number of pandas dependent test cases in test_feather Key: ARROW-8355 URL: https://issues.apache.org/jira/browse/ARROW-8355 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Krisztian Szucs Fix For: 1.0.0 See comment https://github.com/apache/arrow/pull/6849#discussion_r404160096 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8354) [C++][R] Segfault in test-dataset.r
Francois Saint-Jacques created ARROW-8354: - Summary: [C++][R] Segfault in test-dataset.r Key: ARROW-8354 URL: https://issues.apache.org/jira/browse/ARROW-8354 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, R Reporter: Francois Saint-Jacques See https://github.com/fsaintjacques/arrow/runs/564315427#step:6:2169 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8353) [C++] is_nullable maybe not initialized in parquet writer
Neal Richardson created ARROW-8353: -- Summary: [C++] is_nullable maybe not initialized in parquet writer Key: ARROW-8353 URL: https://issues.apache.org/jira/browse/ARROW-8353 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson >From the Rtools build: {code} [ 84%] Building CXX object src/parquet/CMakeFiles/parquet_static.dir/column_reader.cc.obj In file included from D:/a/arrow/arrow/cpp/src/arrow/io/concurrency.h:23:0, from D:/a/arrow/arrow/cpp/src/arrow/io/memory.h:25, from D:/a/arrow/arrow/cpp/src/parquet/platform.h:25, from D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.h:23, from D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.cc:18: D:/a/arrow/arrow/cpp/src/arrow/result.h: In member function 'virtual arrow::Status parquet::arrow::FileWriterImpl::WriteColumnChunk(const std::shared_ptr&, int64_t, int64_t)': D:/a/arrow/arrow/cpp/src/arrow/result.h:428:28: warning: 'is_nullable' may be used uninitialized in this function [-Wmaybe-uninitialized] auto result_name = (rexpr); \ ^ D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.cc:430:10: note: 'is_nullable' was declared here bool is_nullable; ^ {code} I'd give it a default value, but IDK that it's that simple. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8352) [R] Add install_pyarrow()
Neal Richardson created ARROW-8352: -- Summary: [R] Add install_pyarrow() Key: ARROW-8352 URL: https://issues.apache.org/jira/browse/ARROW-8352 Project: Apache Arrow Issue Type: New Feature Reporter: Neal Richardson Assignee: Neal Richardson To facilitate installing for use with reticulate, including handling how to use the nightly packages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8351) [R][CI] Store the Rtools-built Arrow C++ library as a build artifact
Neal Richardson created ARROW-8351: -- Summary: [R][CI] Store the Rtools-built Arrow C++ library as a build artifact Key: ARROW-8351 URL: https://issues.apache.org/jira/browse/ARROW-8351 Project: Apache Arrow Issue Type: New Feature Reporter: Neal Richardson Assignee: Neal Richardson To help with debugging unexplained segfaults. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8350) [Python] Implement to_numpy on ChunkedArray
Uwe Korn created ARROW-8350: --- Summary: [Python] Implement to_numpy on ChunkedArray Key: ARROW-8350 URL: https://issues.apache.org/jira/browse/ARROW-8350 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe Korn We support {{to_numpy}} on Array instances but not on {{ChunkedArray}} instances. It would be quite useful to have it also there to support returning e.g. non-nanosecond datetime instances. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8349) [CI][NIGHTLY:gandiva-jar-osx] Use latest pygit2
Prudhvi Porandla created ARROW-8349: --- Summary: [CI][NIGHTLY:gandiva-jar-osx] Use latest pygit2 Key: ARROW-8349 URL: https://issues.apache.org/jira/browse/ARROW-8349 Project: Apache Arrow Issue Type: Bug Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla Now that homebrew provides compatible libgit2 version, we can use latest pygit2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8348) [C++] Support optional sentinel values in primitive Array for nulls
Francois Saint-Jacques created ARROW-8348: - Summary: [C++] Support optional sentinel values in primitive Array for nulls Key: ARROW-8348 URL: https://issues.apache.org/jira/browse/ARROW-8348 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques This is an optional feature where a sentinel value is stored in null cells and is exposed via an accessor method, e.g. `optional Array::HasSentinel() const;`. This would allow zero-copy bi-directional conversion with R. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8347) [C++] Add Result to Array methods
Antoine Pitrou created ARROW-8347: - Summary: [C++] Add Result to Array methods Key: ARROW-8347 URL: https://issues.apache.org/jira/browse/ARROW-8347 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Micah Kornfield Buffers, Array builders (anythings in the parent directory src/arrow root directory) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8346) [CI][Ruby] GLib/Ruby macOS build fails on zlib
Neal Richardson created ARROW-8346: -- Summary: [CI][Ruby] GLib/Ruby macOS build fails on zlib Key: ARROW-8346 URL: https://issues.apache.org/jira/browse/ARROW-8346 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, GLib Reporter: Neal Richardson Fix For: 0.17.0 See https://github.com/apache/arrow/runs/564610412 for example. {code} Using 'PKG_CONFIG_PATH' from environment with value: '/usr/local/lib/pkgconfig' Run-time dependency gobject-2.0 found: YES 2.64.1 Run-time dependency gio-2.0 found: NO (tried framework and cmake) c_glib/arrow-glib/meson.build:210:0: ERROR: Could not generate cargs for gio-2.0: Package zlib was not found in the pkg-config search path. Perhaps you should add the directory containing `zlib.pc' to the PKG_CONFIG_PATH environment variable Package 'zlib', required by 'gio-2.0', not found A full log can be found at /Users/runner/runners/2.168.0/work/arrow/arrow/build/c_glib/meson-logs/meson-log.txt {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8345) [Python] feather.read_table should not require pandas
Joris Van den Bossche created ARROW-8345: Summary: [Python] feather.read_table should not require pandas Key: ARROW-8345 URL: https://issues.apache.org/jira/browse/ARROW-8345 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.17.0 We still check the pandas version, while pandas is not actually needed. Will do a quick fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8344) [C#] StringArray.Builder.Clear() corrupts subsequent array contents
Adam Szmigin created ARROW-8344: --- Summary: [C#] StringArray.Builder.Clear() corrupts subsequent array contents Key: ARROW-8344 URL: https://issues.apache.org/jira/browse/ARROW-8344 Project: Apache Arrow Issue Type: Bug Components: C# Affects Versions: 0.16.0 Environment: Windows 10 x64 Reporter: Adam Szmigin h1. Summary Using the {{Clear()}} method on a {{StringArray.Builder}} class causes all subsequent built arrays to contain strings consisting solely of whitespace. The below minimal example illustrates: {code:java} namespace ArrowStringArrayBuilderBug { using Apache.Arrow; using Apache.Arrow.Memory; public class Program { private static readonly NativeMemoryAllocator Allocator = new NativeMemoryAllocator(); public static void Main() { var builder = new StringArray.Builder(); AppendBuildPrint(builder, "Hello", "World"); builder.Clear(); AppendBuildPrint(builder, "Foo", "Bar"); } private static void AppendBuildPrint( StringArray.Builder builder, params string[] strings) { foreach (var elem in strings) builder.Append(elem); var arr = builder.Build(Allocator); System.Console.Write("Array contents: ["); for (var i = 0; i < arr.Length; i++) { if (i > 0) System.Console.Write(", "); System.Console.Write($"'{arr.GetString(i)}'"); } System.Console.WriteLine("]"); } } {code} h2. Expected Output {noformat} Array contents: ['Hello', 'World'] Array contents: ['Foo', 'Bar'] {noformat} h2. Actual Output {noformat} Array contents: ['Hello', 'World'] Array contents: [' ', ' '] {noformat} h1. Workaround The bug can be trivially worked around by constructing a new {{StringArray.Builder}} instead of calling {{Clear()}}. The issue ARROW-7040 mentions other issues with string arrays in C#, but I'm not sure if this is related or not. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8343) [GLib] Add GArrowRecordBatchIterator
Kenta Murata created ARROW-8343: --- Summary: [GLib] Add GArrowRecordBatchIterator Key: ARROW-8343 URL: https://issues.apache.org/jira/browse/ARROW-8343 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8342) [Python] dask and kartothek integration tests are failing
Joris Van den Bossche created ARROW-8342: Summary: [Python] dask and kartothek integration tests are failing Key: ARROW-8342 URL: https://issues.apache.org/jira/browse/ARROW-8342 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.17.0 The integration tests for both dask and kartothek, and for both master and latest released version of them, started failing the last days. Dask latest: https://circleci.com/gh/ursa-labs/crossbow/10629?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link Kartothek latest: https://circleci.com/gh/ursa-labs/crossbow/10604?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link I think both are related to the KeyValueMetadata changes (ARROW-8079). The kartothek one is clearly related, as it gives: TypeError: 'pyarrow.lib.KeyValueMetadata' object does not support item assignment And I think the dask one is related to the "pandas" key now being present twice, and therefore it is using the "wrong" one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8341) [Packaging][deb] Fail to build by no disk space
Kouhei Sutou created ARROW-8341: --- Summary: [Packaging][deb] Fail to build by no disk space Key: ARROW-8341 URL: https://issues.apache.org/jira/browse/ARROW-8341 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8340) [Documentation] Sphinx documentation does not build with just-released Sphinx 3.0.0
Wes McKinney created ARROW-8340: --- Summary: [Documentation] Sphinx documentation does not build with just-released Sphinx 3.0.0 Key: ARROW-8340 URL: https://issues.apache.org/jira/browse/ARROW-8340 Project: Apache Arrow Issue Type: Bug Components: Documentation, Python Reporter: Wes McKinney Fix For: 0.17.0 I'll add a version pin in a docs PR I'm working on, but this needs to be fixed soon -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8339) [C++] Possibly allow null offsets and/or data buffer for BaseBinaryArray for 0-length arrays
Wes McKinney created ARROW-8339: --- Summary: [C++] Possibly allow null offsets and/or data buffer for BaseBinaryArray for 0-length arrays Key: ARROW-8339 URL: https://issues.apache.org/jira/browse/ARROW-8339 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney related to ARROW-8338. This issues was raised in ARROW-7008 but we maintained the status quo of requiring non-null buffers in both cases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8338) [Format] Clarify whether 0-length variable offsets buffers are permissible for 0-length arrays in the IPC protocol
Wes McKinney created ARROW-8338: --- Summary: [Format] Clarify whether 0-length variable offsets buffers are permissible for 0-length arrays in the IPC protocol Key: ARROW-8338 URL: https://issues.apache.org/jira/browse/ARROW-8338 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Format Reporter: Wes McKinney This aspect of the columnar format / IPC protocol remains slightly unclear. As written, it would suggest that an offsets buffer of length 1 containing a single value 0 is required. It may be better to allow this to be length zero (corresponding to a 0-size or null buffer in the implementation) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8337) [Release] Verify release candidate wheels without using conda
Neal Richardson created ARROW-8337: -- Summary: [Release] Verify release candidate wheels without using conda Key: ARROW-8337 URL: https://issues.apache.org/jira/browse/ARROW-8337 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Neal Richardson See final comments on ARROW-2880 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8336) [Packaging][deb] Use libthrift-dev on Debian 10 and Ubuntu 19.10 or later
Kouhei Sutou created ARROW-8336: --- Summary: [Packaging][deb] Use libthrift-dev on Debian 10 and Ubuntu 19.10 or later Key: ARROW-8336 URL: https://issues.apache.org/jira/browse/ARROW-8336 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8335) [Release] Add crossbow jobs to run release verification
Neal Richardson created ARROW-8335: -- Summary: [Release] Add crossbow jobs to run release verification Key: ARROW-8335 URL: https://issues.apache.org/jira/browse/ARROW-8335 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 0.17.0 Workflow: edit version number and rc number in template in {{dev/release/github.verify.yml}}, make PR, and do * {{@github-actions crossbow submit -g verify-rc}} to run everything * {{@github-actions crossbow submit -g verify-rc-wheel|source|binary}} to run those groups * Other groups at {{verify-rc-wheel|source-macos|ubuntu|windows}}, {{verify-rc-source-cpp|csharp|java|etc.}} * Individual workflows at e.g. {{verify-rc-wheel-windows}}, {{verify-rc-source-macos-csharp}}. We could break out the wheel verification by python version (maybe we should), but that requires changes to the verification scripts themselves. Running the main {{verify-rc}} group will put a ton of workflow svg badges on the PR so we can see at a glance what is passing and failing. If things fail when running all, can push fixes to the verification script to the branch and retry just those that failed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8334) Missing DATE32 in Gandiva
Dominik Durner created ARROW-8334: - Summary: Missing DATE32 in Gandiva Key: ARROW-8334 URL: https://issues.apache.org/jira/browse/ARROW-8334 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Dominik Durner -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8333) [C++][CI] Always that benchmarks compile in some C++ CI entry
Wes McKinney created ARROW-8333: --- Summary: [C++][CI] Always that benchmarks compile in some C++ CI entry Key: ARROW-8333 URL: https://issues.apache.org/jira/browse/ARROW-8333 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 As exposed in ARROW-8331, apparently we do not check. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8332) [C++] Require Thrift compiler to use system libthrift for Parquet build
Kouhei Sutou created ARROW-8332: --- Summary: [C++] Require Thrift compiler to use system libthrift for Parquet build Key: ARROW-8332 URL: https://issues.apache.org/jira/browse/ARROW-8332 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8331) [C++] arrow-compute-filter-benchmark fails to compile
Wes McKinney created ARROW-8331: --- Summary: [C++] arrow-compute-filter-benchmark fails to compile Key: ARROW-8331 URL: https://issues.apache.org/jira/browse/ARROW-8331 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.17.0 Are the benchmarks not being built in CI? {code} ../src/arrow/compute/kernels/filter_benchmark.cc:45:18: error: no matching function for call to 'Filter' ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out)); ^~ ../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 'ABORT_NOT_OK' auto _res = (expr); \ ^~~~ ../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not viable: requires 5 arguments, but 4 were provided Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter, ^ ../src/arrow/compute/kernels/filter_benchmark.cc:66:18: error: no matching function for call to 'Filter' ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out)); ^~ ../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 'ABORT_NOT_OK' auto _res = (expr); \ ^~~~ ../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not viable: requires 5 arguments, but 4 were provided Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter, ^ ../src/arrow/compute/kernels/filter_benchmark.cc:90:18: error: no matching function for call to 'Filter' ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out)); ^~ ../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 'ABORT_NOT_OK' auto _res = (expr); \ ^~~~ ../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not viable: requires 5 arguments, but 4 were provided Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter, ^ 3 errors generated. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8330) [Documentation] The post release script generates the documentation with a development version
Krisztian Szucs created ARROW-8330: -- Summary: [Documentation] The post release script generates the documentation with a development version Key: ARROW-8330 URL: https://issues.apache.org/jira/browse/ARROW-8330 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.17.0 See the current documentation page. Also regenerate the github page. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8329) [Documentation][C++] Undocumented FilterOptions argument in Filter kernel
Krisztian Szucs created ARROW-8329: -- Summary: [Documentation][C++] Undocumented FilterOptions argument in Filter kernel Key: ARROW-8329 URL: https://issues.apache.org/jira/browse/ARROW-8329 Project: Apache Arrow Issue Type: Task Components: C++, Documentation Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.17.0 The documentation build fails, see https://github.com/apache/arrow/runs/558617620#step:6:1186 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8328) [C++] MSVC is not respecting warning-disable flags
Ben Kietzman created ARROW-8328: --- Summary: [C++] MSVC is not respecting warning-disable flags Key: ARROW-8328 URL: https://issues.apache.org/jira/browse/ARROW-8328 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.16.0 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 We provide [warning-disabling flags to MSVC|https://github.com/apache/arrow/blob/72433c6/cpp/cmake_modules/SetupCxxFlags.cmake#L151-L153] including one which should disable all conversion warnings. However this is not completely effectual and Appveyor will still emit conversion warnings (which are then treated as errors), requiring insertion of otherwise unnecessary explicit casts or {{#pragma}}s (for example https://github.com/apache/arrow/pull/6820 ). Perhaps flag ordering is significant? In any case, as we have conversion warnings disabled for other compilers we should ensure they are completely disabled for MSVC as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)