[jira] [Created] (ARROW-8317) [C++] grpc-cpp 1.28.0 from conda-forge causing Appveyor build to fail

2020-04-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8317:
---

 Summary: [C++] grpc-cpp 1.28.0 from conda-forge causing Appveyor 
build to fail
 Key: ARROW-8317
 URL: https://issues.apache.org/jira/browse/ARROW-8317
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


This started occurring in the last few hours since grpc-cpp 1.28.0 update was 
just merged on conda-forge

https://ci.appveyor.com/project/wesm/arrow/build/job/8oe0n4epkxegr21x



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8320) [Documentation][Format] Clarify (lack of) alignment requirements in C data interface

2020-04-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8320:
---

 Summary: [Documentation][Format] Clarify (lack of) alignment 
requirements in C data interface
 Key: ARROW-8320
 URL: https://issues.apache.org/jira/browse/ARROW-8320
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Wes McKinney
 Fix For: 0.17.0


This document should clarify that memory buffers need not start on aligned 
pointer offsets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8331) [C++] arrow-compute-filter-benchmark fails to compile

2020-04-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8331:
---

 Summary: [C++] arrow-compute-filter-benchmark fails to compile
 Key: ARROW-8331
 URL: https://issues.apache.org/jira/browse/ARROW-8331
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


Are the benchmarks not being built in CI?

{code}
../src/arrow/compute/kernels/filter_benchmark.cc:45:18: error: no matching 
function for call to 'Filter'
ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out));
 ^~
../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 
'ABORT_NOT_OK'
auto _res = (expr); \
 ^~~~
../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not 
viable: requires 5 arguments, but 4 were provided
Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter,
   ^
../src/arrow/compute/kernels/filter_benchmark.cc:66:18: error: no matching 
function for call to 'Filter'
ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out));
 ^~
../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 
'ABORT_NOT_OK'
auto _res = (expr); \
 ^~~~
../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not 
viable: requires 5 arguments, but 4 were provided
Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter,
   ^
../src/arrow/compute/kernels/filter_benchmark.cc:90:18: error: no matching 
function for call to 'Filter'
ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out));
 ^~
../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 
'ABORT_NOT_OK'
auto _res = (expr); \
 ^~~~
../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not 
viable: requires 5 arguments, but 4 were provided
Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter,
   ^
3 errors generated.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8333) [C++][CI] Always that benchmarks compile in some C++ CI entry

2020-04-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8333:
---

 Summary: [C++][CI] Always that benchmarks compile in some C++ CI 
entry 
 Key: ARROW-8333
 URL: https://issues.apache.org/jira/browse/ARROW-8333
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


As exposed in ARROW-8331, apparently we do not check. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8338) [Format] Clarify whether 0-length variable offsets buffers are permissible for 0-length arrays in the IPC protocol

2020-04-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8338:
---

 Summary: [Format] Clarify whether 0-length variable offsets 
buffers are permissible for 0-length arrays in the IPC protocol
 Key: ARROW-8338
 URL: https://issues.apache.org/jira/browse/ARROW-8338
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Wes McKinney


This aspect of the columnar format / IPC protocol remains slightly unclear. As 
written, it would suggest that an offsets buffer of length 1 containing a 
single value 0 is required. It may be better to allow this to be length zero 
(corresponding to a 0-size or null buffer in the implementation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8339) [C++] Possibly allow null offsets and/or data buffer for BaseBinaryArray for 0-length arrays

2020-04-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8339:
---

 Summary: [C++] Possibly allow null offsets and/or data buffer for 
BaseBinaryArray for 0-length arrays
 Key: ARROW-8339
 URL: https://issues.apache.org/jira/browse/ARROW-8339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


related to ARROW-8338. This issues was raised in ARROW-7008 but we maintained 
the status quo of requiring non-null buffers in both cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8340) [Documentation] Sphinx documentation does not build with just-released Sphinx 3.0.0

2020-04-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8340:
---

 Summary: [Documentation] Sphinx documentation does not build with 
just-released Sphinx 3.0.0
 Key: ARROW-8340
 URL: https://issues.apache.org/jira/browse/ARROW-8340
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, Python
Reporter: Wes McKinney
 Fix For: 0.17.0


I'll add a version pin in a docs PR I'm working on, but this needs to be fixed 
soon



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8356) [Developer] Support * wildcards with "crossbow submit" via GitHub actions

2020-04-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8356:
---

 Summary: [Developer] Support * wildcards with "crossbow submit" 
via GitHub actions
 Key: ARROW-8356
 URL: https://issues.apache.org/jira/browse/ARROW-8356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney


While the "group" feature can be useful, sometimes there is a group of builds 
that do not fit neatly into a particular group



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8358) [C++] Fix -Wrange-loop-construct warnings in clang-11

2020-04-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8358:
---

 Summary: [C++] Fix -Wrange-loop-construct warnings in clang-11 
 Key: ARROW-8358
 URL: https://issues.apache.org/jira/browse/ARROW-8358
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


We might change one of our CI entries to use clang-11 so we get some more 
bleeding edge compiler warnings, to get out ahead of things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8368) [Format] In C interface, clarify resource management for consumers needing only a subset of child fields in ArrowArray

2020-04-07 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8368:
---

 Summary: [Format] In C interface, clarify resource management for 
consumers needing only a subset of child fields in ArrowArray
 Key: ARROW-8368
 URL: https://issues.apache.org/jira/browse/ARROW-8368
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney


The current implication of the C Interface is that only moving a single child 
out of an ArrowArray is allowed. 

Questions:

* Should it be allowed to move multiple children, as long as they are moved at 
the same time, and the parent is released after?
* In the event that children have disjoint internal resources, should there be 
a clarification around moved children having their resources released 
independently?

See mailing list discussion 
https://lists.apache.org/thread.html/r92b77e0fa7bed384daa377e2178bc8e8ca46103928598050341e40b1%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8373) [GLib] Problems resolving gobject-introspection, arrow in Meson builds

2020-04-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8373:
---

 Summary: [GLib] Problems resolving gobject-introspection, arrow in 
Meson builds
 Key: ARROW-8373
 URL: https://issues.apache.org/jira/browse/ARROW-8373
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Wes McKinney
 Fix For: 0.17.0


See example failure 
https://github.com/apache/arrow/pull/6872/checks?check_run_id=571082161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8384) [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket

2020-04-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8384:
---

 Summary: [C++][Python] arrow/filesystem/hdfs.h and Python wrapper 
does not have an option for setting a path to a Kerberos ticket
 Key: ARROW-8384
 URL: https://issues.apache.org/jira/browse/ARROW-8384
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


This feature seems to have been dropped

Is there a plan for migrating users to the new filesystem API? We have two 
different code paths now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8391) [C++] Implement row range read API for IPC file (and Feather)

2020-04-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8391:
---

 Summary: [C++] Implement row range read API for IPC file (and 
Feather)
 Key: ARROW-8391
 URL: https://issues.apache.org/jira/browse/ARROW-8391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


The objective would be able to read a range of rows from the middle of a file. 
It's not as easy as it might sound since all the record batch metadata must be 
examined to determine the start and end point of the row range



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8408) [Python] Add memory_map= toggle to pyarrow.feather.read_feather

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8408:
---

 Summary: [Python] Add memory_map= toggle to 
pyarrow.feather.read_feather
 Key: ARROW-8408
 URL: https://issues.apache.org/jira/browse/ARROW-8408
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.17.0


I missed this in my prior patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8409) [R] Add arrow::cpu_count, arrow::set_cpu_count wrapper functions a la Python

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8409:
---

 Summary: [R] Add arrow::cpu_count, arrow::set_cpu_count wrapper 
functions a la Python
 Key: ARROW-8409
 URL: https://issues.apache.org/jira/browse/ARROW-8409
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
 Fix For: 0.17.0


While some people will configure these with {{$OMP_NUM_THREADS}}, it is useful 
to be able to configure the global thread pool dynamically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8410) [C++] CMake fails on aarch64 systems that do not support -march=armv8-a+crc+crypto

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8410:
---

 Summary: [C++] CMake fails on aarch64 systems that do not support 
-march=armv8-a+crc+crypto
 Key: ARROW-8410
 URL: https://issues.apache.org/jira/browse/ARROW-8410
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


I was trying to build the project on a rockpro64 system to look into something 
else and ran into this

{code}
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:332 (message):
  Unsupported arch flag: -march=armv8-a+crc+crypto.
Call Stack (most recent call first):
  CMakeLists.txt:398 (include)

-- Configuring incomplete, errors occurred!
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8411) [C++] gcc6 warning re: arrow::internal::ArgSort

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8411:
---

 Summary: [C++] gcc6 warning re: arrow::internal::ArgSort
 Key: ARROW-8411
 URL: https://issues.apache.org/jira/browse/ARROW-8411
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


Observed on a Debian platform with gcc6 base

{code}
In file included from /usr/include/c++/6/algorithm:62:0,
 from ../src/arrow/util/bit_util.h:55,
 from ../src/arrow/type_traits.h:26,
 from ../src/arrow/array.h:32,
 from ../src/arrow/compute/kernel.h:24,
 from ../src/arrow/dataset/filter.h:27,
 from ../src/arrow/dataset/partition.h:27,
 from 
/home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18:
/usr/include/c++/6/bits/stl_algo.h: In function 'void 
std::__insertion_sort(_RandomAccessIterator, _RandomAccessIterator, _Compare) 
[with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Compare = 
__gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = 
std::less >]:: >]':
/usr/include/c++/6/bits/stl_algo.h:1837:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
 __insertion_sort(_RandomAccessIterator __first,
 ^~~~
/usr/include/c++/6/bits/stl_algo.h:1837:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
In file included from /usr/include/c++/6/bits/stl_algo.h:61:0,
 from /usr/include/c++/6/algorithm:62,
 from ../src/arrow/util/bit_util.h:55,
 from ../src/arrow/type_traits.h:26,
 from ../src/arrow/array.h:32,
 from ../src/arrow/compute/kernel.h:24,
 from ../src/arrow/dataset/filter.h:27,
 from ../src/arrow/dataset/partition.h:27,
 from 
/home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18:
/usr/include/c++/6/bits/stl_heap.h: In function 'void 
std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) 
[with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Distance = int; _Tp = long long int; _Compare = 
__gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = 
std::less >]:: >]':
/usr/include/c++/6/bits/stl_heap.h:209:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
 __adjust_heap(_RandomAccessIterator __first, _Distance __holeIndex,
 ^
In file included from /usr/include/c++/6/algorithm:62:0,
 from ../src/arrow/util/bit_util.h:55,
 from ../src/arrow/type_traits.h:26,
 from ../src/arrow/array.h:32,
 from ../src/arrow/compute/kernel.h:24,
 from ../src/arrow/dataset/filter.h:27,
 from ../src/arrow/dataset/partition.h:27,
 from 
/home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18:
/usr/include/c++/6/bits/stl_algo.h: In function 'void 
std::__introsort_loop(_RandomAccessIterator, _RandomAccessIterator, _Size, 
_Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Size = int; _Compare = 
__gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = 
std::less >]:: >]':
/usr/include/c++/6/bits/stl_algo.h:1937:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
 __introsort_loop(_RandomAccessIterator __first,
 ^~~~
/usr/include/c++/6/bits/stl_algo.h:1937:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
/usr/include/c++/6/bits/stl_algo.h:1951:4: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
std::__introsort_loop(__cut, __last, __depth_limit, __comp);
^~~
/usr/include/c++/6/bits/stl_algo.h: In function 'std::vector 
arrow::internal::ArgSort(const std::vector&, Cmp&&) [with T = 
std::__cxx11::basic_string; Cmp = 
std::less >]':
/usr/include/c++/6/bits/stl_algo.h:1882:4: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
std::__insertion_sort(__first, __first + int(_S_threshold), __comp);
^~~
/usr/include/c++/6/bits/stl_algo.h:1887:2: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
  std::__insertion_sort(__first, __last, __comp);
  ^~~
/usr/include/c++/6/bits/stl_algo.h:1965:4: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
std::__introsort_loop(__first, __last,
^~~
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8419) [C++] Default display for multi-choice define_option_string is misleading

2020-04-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8419:
---

 Summary: [C++] Default display for multi-choice 
define_option_string is misleading
 Key: ARROW-8419
 URL: https://issues.apache.org/jira/browse/ARROW-8419
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


While working on ARROW-8410, I noticed:

{code}
--   ARROW_SIMD_LEVEL=AVX2 [default=NONE|SSE4_2|AVX2|AVX512]
--   SIMD compiler optimization level
--   ARROW_ARMV8_ARCH=armv8-a+crc+crypto [default=armv8-a|armv8-a+crc+crypto]
--   Arm64 arch and extensions
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8420) [C++] CMake fails to configure on armv7l platform (e.g. Raspberry Pi 3)

2020-04-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8420:
---

 Summary: [C++] CMake fails to configure on armv7l platform (e.g. 
Raspberry Pi 3)
 Key: ARROW-8420
 URL: https://issues.apache.org/jira/browse/ARROW-8420
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


Related to ARROW-8410, but probably will resolve the ARMv7 issues in a separate 
PR

{code}
$ cmake .. -DARROW_BUILD_TESTS=ON -DARROW_ORC=ON -DARROW_PARQUET=ON 
-DARROW_DEPENDENCY_SOURCE=BUNDLED -GNinja
-- Building using CMake version: 3.13.4
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT')
-- Arrow SO version: 100 (full: 100.0.0)
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29") 
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found Python3: /usr/bin/python3.7 (found version "3.7.3") found components:  
Interpreter 
-- Found cpplint executable at /home/pi/code/arrow/cpp/build-support/cpplint.py
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Failed
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Failed
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Failed
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:318 (message):
  SSE4.2 required but compiler doesn't support it.
Call Stack (most recent call first):
  CMakeLists.txt:399 (include)


-- Configuring incomplete, errors occurred!
See also "/home/pi/code/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
See also "/home/pi/code/arrow/cpp/build/CMakeFiles/CMakeError.log".
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8470) [Python][R] Expose incremental write API for Feather files

2020-04-15 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8470:
---

 Summary: [Python][R] Expose incremental write API for Feather files
 Key: ARROW-8470
 URL: https://issues.apache.org/jira/browse/ARROW-8470
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python, R
Reporter: Wes McKinney


This is already available for writing IPC files, so this would mostly be an 
interface to that with the addition of logic to handle conversions from Python 
or R data frames and splitting the inputs based on the configured Feather 
chunksize



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8476) [C++] Create "libarrow_thrift" containing all code requiring the Thrift libraries

2020-04-15 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8476:
---

 Summary: [C++] Create "libarrow_thrift" containing all code 
requiring the Thrift libraries
 Key: ARROW-8476
 URL: https://issues.apache.org/jira/browse/ARROW-8476
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


The purpose of this is to avoid having to ever having to statically link 
libthrift.a into more than one shared library. Currently we are statically 
linking into libparquet.so, but there are some other efforts (e.g. 
libarrow_hiveserver2, which I'd eventually like to become production-worthy) 
where Thrift symbols are required. By factoring out the serialization code into 
a helper library we avoid the linking conundrum



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8500) [C++] Use selection vectors in Filter implementation for record batches, tables

2020-04-17 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8500:
---

 Summary: [C++] Use selection vectors in Filter implementation for 
record batches, tables
 Key: ARROW-8500
 URL: https://issues.apache.org/jira/browse/ARROW-8500
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The current implementation of {{Filter}} on RecordBatch, Table does redundant 
analysis of the filter array. It would be more efficient in most cases (i.e. 
whenever there are multiple columns) to convert the boolean array into a 
selection vector and then use {{Take}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8510) [Release][C++] Dataset unit tests are reported as failed in arrow-verify-release-candidate.bat

2020-04-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8510:
---

 Summary: [Release][C++] Dataset unit tests are reported as failed 
in arrow-verify-release-candidate.bat
 Key: ARROW-8510
 URL: https://issues.apache.org/jira/browse/ARROW-8510
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


The command used to execute the tests is {{ctest -VV  || exit /B 1}}. Here are 
the errors I see

https://gist.github.com/wesm/2526a540554f5ab7511d144133676ebc

The unit tests are not built for some reason despite {{ARROW_BUILD_TESTS=ON}} 
and {{ARROW_DATASET=ON}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8511) [Developer][Release] Windows release verification script does not halt if C++ compilation fails

2020-04-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8511:
---

 Summary: [Developer][Release] Windows release verification script 
does not halt if C++ compilation fails 
 Key: ARROW-8511
 URL: https://issues.apache.org/jira/browse/ARROW-8511
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This made finding the issue in ARROW-8510 more difficult



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8512) [C++] Delete unused compute expr prototype code

2020-04-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8512:
---

 Summary: [C++] Delete unused compute expr prototype code
 Key: ARROW-8512
 URL: https://issues.apache.org/jira/browse/ARROW-8512
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Most of the code added in 
https://github.com/apache/arrow/commit/08ca13f83f3d6dbd818c4280d619dae306aa9de5 
can be deleted. I may leave some of the "shape" types in case we can make use 
of those. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8513) [Python] Expose Take with Table input in Python

2020-04-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8513:
---

 Summary: [Python] Expose Take with Table input in Python
 Key: ARROW-8513
 URL: https://issues.apache.org/jira/browse/ARROW-8513
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This is implemented in C++ but not exposed in the bindings



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8514) [Developer] Windows wheel verification script does not check Python 3.5

2020-04-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8514:
---

 Summary: [Developer] Windows wheel verification script does not 
check Python 3.5
 Key: ARROW-8514
 URL: https://issues.apache.org/jira/browse/ARROW-8514
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8517) [Developer][Release] Update Crossbow RC verification setup for changes since 0.16.0

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8517:
---

 Summary: [Developer][Release] Update Crossbow RC verification 
setup for changes since 0.16.0
 Key: ARROW-8517
 URL: https://issues.apache.org/jira/browse/ARROW-8517
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
Assignee: Neal Richardson
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8518) [Python] Create tools to enable optional components (like Gandiva, Flight) to be built and deployed as separate Python packages

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8518:
---

 Summary: [Python] Create tools to enable optional components (like 
Gandiva, Flight) to be built and deployed as separate Python packages
 Key: ARROW-8518
 URL: https://issues.apache.org/jira/browse/ARROW-8518
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Our current monolithic approach to Python packaging isn't likely to be 
sustainable long-term.

At a high level, I would propose a structure like this:

{code}
pip install pyarrow  # core package containing libarrow, libarrow_python, and 
any other common bundled C++ library dependencies

pip install pyarrow-flight  # installs pyarrow, pyarrow_flight

pip install pyarrow-gandiva # installs pyarrow, pyarrow_gandiva
{code}

We can maintain the semantic appearance of a single {{pyarrow}} package by 
having thin API modules that would look like

{code}
CONTENTS OF pyarrow/flight.py

from pyarrow_flight import *
{code}

Obviously, this is more difficult to build and package:

* CMake and setup.py files must be refactored a bit so that we can reuse code 
between the parent and child packages
* Separate conda and wheel packages must be produced. With conda this seems 
more straightforward but since the child wheels depend on the parent core 
wheel, the build process seems more complicated

In any case, I don't think these challenges are insurmountable. This will have 
several benefits:

* Smaller installation footprint for simple use cases (though note we are STILL 
duplicating shared libraries in the wheels, which is quite bad)
* Less developer anxiety about expanding the scope of what Python code is 
shipped from apache/arrow. If in 5 years we are shipping 5 different Python 
wheels with each Apache Arrow release, that sounds completely fine to me. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8520) [Developer] Use .asf.yaml to direct GitHub notifications to e-mail lists and JIRA

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8520:
---

 Summary: [Developer] Use .asf.yaml to direct GitHub notifications 
to e-mail lists and JIRA
 Key: ARROW-8520
 URL: https://issues.apache.org/jira/browse/ARROW-8520
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


Context: INFRA-20149



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8521) [Developer] Group Sub-task, Task, Test, and Wish issue types as "Improvement" in Changelog

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8521:
---

 Summary: [Developer] Group Sub-task, Task, Test, and Wish issue 
types as "Improvement" in Changelog
 Key: ARROW-8521
 URL: https://issues.apache.org/jira/browse/ARROW-8521
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney


In my opinion this makes the changelog more readable. The "Wish" type in 
particular might cause people to not see improvements



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8522) [Developer] Add environment variable option to toggle whether ephemeral NodeJS is installed in release verification script

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8522:
---

 Summary: [Developer] Add environment variable option to toggle 
whether ephemeral NodeJS is installed in release verification script
 Key: ARROW-8522
 URL: https://issues.apache.org/jira/browse/ARROW-8522
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


Currently the script fails if the user does not have a new enough NodeJS and 
associated utilities (npm/npx) on their system. 

The code to install NodeJS exists already but it was disabled in 
https://github.com/apache/arrow/commit/e570db9c45ca97f77c5633e5525c02f55dbb6c4b#diff-8cc7fa3ae5de30b356c17d7a4b59fe09
 because it caused problems when run in GitHub Actions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8531) [C++] Deprecate ARROW_USE_SIMD CMake option

2020-04-20 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8531:
---

 Summary: [C++] Deprecate ARROW_USE_SIMD CMake option
 Key: ARROW-8531
 URL: https://issues.apache.org/jira/browse/ARROW-8531
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is superseded by the {{ARROW_SIMD_LEVEL}} option



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8578) [C++][Flight] Test executable failures due to "SO_REUSEPORT unavailable on compiling system"

2020-04-23 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8578:
---

 Summary: [C++][Flight] Test executable failures due to 
"SO_REUSEPORT unavailable on compiling system"
 Key: ARROW-8578
 URL: https://issues.apache.org/jira/browse/ARROW-8578
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Wes McKinney
 Fix For: 1.0.0


Tried compiling and running this today  (with grpc 1.28.1)

{code}
$ release/arrow-flight-benchmark 
Using standalone server: false
Server running with pid 22385
Testing method: DoGet
Server host: localhost
Server port: 31337
E0423 21:54:15.174285695   22385 socket_utils_common_posix.cc:222] check for 
SO_REUSEPORT: {"created":"@1587696855.174280083","description":"SO_REUSEPORT 
unavailable on compiling 
system","file":"../src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":190}
Server host: localhost
{code}

my Linux kernel

{code}
$ uname -a
Linux 4.15.0-1079-oem #89-Ubuntu SMP Fri Mar 27 05:22:11 UTC 2020 x86_64 x86_64 
x86_64 GNU/Linux
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8599) [C++][Parquet] Optional parallel processing when writing Parquet files

2020-04-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8599:
---

 Summary: [C++][Parquet] Optional parallel processing when writing 
Parquet files
 Key: ARROW-8599
 URL: https://issues.apache.org/jira/browse/ARROW-8599
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


If we permit encoded columns in row groups to be buffered in memory rather than 
immediately written out to the {{OutputStream}}, then we can use multiple 
threads for the encoding / compression of columns. Combined with a separate 
thread to take the encoded columns and write them out to disk, this should 
yield substantially improved file write times.

This could be enabled through an option since it would increase memory use when 
writing. The memory use can be somewhat constrained by limiting the size of row 
groups



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8619) [C++] Use distinct Type::type values for interval types

2020-04-28 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8619:
---

 Summary: [C++] Use distinct Type::type values for interval types
 Key: ARROW-8619
 URL: https://issues.apache.org/jira/browse/ARROW-8619
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is a breaking API change, but {{MonthIntervalType}} and 
{{DayTimeIntervalType}} are different data types (and have different value 
sizes, which is not true of timestamps) and thus should be distinguished in the 
same way that DATE32 / DATE64 are distinguished, or TIME32 / TIME64 are 
distinguished



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8623) [C++][Gandiva] Reduce use of Boost, remove Boost headers from header files

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8623:
---

 Summary: [C++][Gandiva] Reduce use of Boost, remove Boost headers 
from header files
 Key: ARROW-8623
 URL: https://issues.apache.org/jira/browse/ARROW-8623
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


Boost is currently a transitive dependency of many of Gandiva's public header 
files. I suggest the following:

* Do not include Boost transitively in any installed header file
* Reduce usages of Boost altogether

On the latter point, most usages of Boost can be trimmed by having a 
{{hash_combine}} function inside the Arrow codebase. See results of grepping 
the codebase

https://gist.github.com/wesm/190006d91628e6bf7c04deb596a52cff

It seems that Boost cannot be easily eliminated altogether at the present 
moment because of a use of Boost.Multiprecision ({{int256_t}}). At some point 
someone may want to implement sufficient 256-bit integer functions so that we 
don't have to depend on Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8626) [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8626:
---

 Summary: [C++] Implement "round robin" scheduler interface to 
fixed-size ThreadPool 
 Key: ARROW-8626
 URL: https://issues.apache.org/jira/browse/ARROW-8626
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Currently, when submitting tasks to a thread pool, they are all commingled in a 
common queue. When a new task submitter shows up, they must wait in the back of 
the line behind all other queued tasks.

A simple alternative to this would be round-robin scheduling, where each new 
consumer is assigned a unique integer id, and the schedule / thread pool 
internally maintains the tasks associated with the consumer in separate queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8633) [C++] Add ValidateAscii function

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8633:
---

 Summary: [C++] Add ValidateAscii function
 Key: ARROW-8633
 URL: https://issues.apache.org/jira/browse/ARROW-8633
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


In some cases, we want to be able to check whether it's safe to use functions 
that assume ASCII (like {{std::tolower}}, or {{std::string::substr). This was 
implemented in a PR for ARROW-6131 that was not merged



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8635) [R] test-filesystem.R takes ~40 seconds to run?

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8635:
---

 Summary: [R] test-filesystem.R takes ~40 seconds to run?
 Key: ARROW-8635
 URL: https://issues.apache.org/jira/browse/ARROW-8635
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
✔ |  22   | Expressions
✔ | 107   | Feather [0.2 s]
✔ |   7   | Field
✔ |  40   | File system [38.1 s]
✔ |   6   | install_arrow()
✔ |  26   | JsonTableReader [0.1 s]
✔ |  24   | MessageReader
✔ |  12   | Message
✔ |  31   | Parquet file reading/writing [0.2 s]
⠏ |   0   | To/from Pythonvirtualenv: arrow-test
{code}

Is this expected? I assume it's related to S3 but that seems like a long time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8660:
---

 Summary: [C++][Gandiva] Reduce dependence on Boost
 Key: ARROW-8660
 URL: https://issues.apache.org/jira/browse/ARROW-8660
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Remove Boost usages aside from Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8661:
---

 Summary: [C++][Gandiva] Reduce number of files and headers
 Key: ARROW-8661
 URL: https://issues.apache.org/jira/browse/ARROW-8661
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


I feel that the Gandiva subpackage is more Java-like in its code organization 
than the rest of the Arrow codebase, and it might be easier to navigate and 
develop with closely related code condensed into some larger headers and 
compilation units.

Additionally, it's not necessary to have a header file for each component of 
the function registry -- the registration functions can be declared in 
function_registry.h or function_registry_internal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8667) [C++] Add multi-consumer Scheduler API to sit one layer above ThreadPool

2020-05-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8667:
---

 Summary: [C++] Add multi-consumer Scheduler API to sit one layer 
above ThreadPool
 Key: ARROW-8667
 URL: https://issues.apache.org/jira/browse/ARROW-8667
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I believe we should define an abstraction to allow for custom resource 
allocation strategies (round robin, even time, etc.) to be devised for 
situations where there are different thread pool consumers that are working 
independently of each other.

Consider the classic nested parallelism scenario:

* Task A in thread 1 may issue N subtasks that run in parallel
* Task B in thread 2 may issue K subtasks

With our current ThreadPool abstraction, it is easy to conceive scenarios where 
either Task A or Task B trample each other. 

One approach to remedy this problem is to have an API like so:

{code}
// Inform the scheduler that you want to submit tasks that are "your tasks"
int consumer_id = scheduler->NewConsumer();

for (...) {
  Future fut = scheduler->Submit(consumer_id, DoWork, ...);
}

scheduler->FinishConsumer(consumer_id);
{code}

The idea is that the scheduler would maintain separate task queues for each 
consumer and e.g. track consumer-specific metrics of interest to determine how 
tasks are allocated.

The scheduler could have different logic to control tasks being assigned to 
worker threads:

* Round-robin
* Even-time allocation (run fewer tasks for consumers with "slow" tasks and 
more tasks from consumers with "fast" tasks -- though there are some nuances 
here like avoiding starving a consumer if they've been doing a lot of "slow" 
tasks and then a "fast" consumer shows up)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8670) [Format] Create reference implementations of IPC RecordBatch body compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8670:
---

 Summary: [Format] Create reference implementations of IPC 
RecordBatch body compression from ARROW-300 
 Key: ARROW-8670
 URL: https://issues.apache.org/jira/browse/ARROW-8670
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


Tracking JIRA for implementing ARROW-300 in different PLs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8671) [C++] Use IPC body compression metadata approved in ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8671:
---

 Summary: [C++] Use IPC body compression metadata approved in 
ARROW-300 
 Key: ARROW-8671
 URL: https://issues.apache.org/jira/browse/ARROW-8671
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This will adapt the existing code to use the new metadata, while maintaining 
backward compatibility code to recognize the "experimental" metadata written in 
0.17.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8672) [Java] Implement RecordBatch IPC buffer compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8672:
---

 Summary: [Java] Implement RecordBatch IPC buffer compression from 
ARROW-300
 Key: ARROW-8672
 URL: https://issues.apache.org/jira/browse/ARROW-8672
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8673) [Go] Implement IPC RecordBatch body compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8673:
---

 Summary: [Go] Implement IPC RecordBatch body compression from 
ARROW-300
 Key: ARROW-8673
 URL: https://issues.apache.org/jira/browse/ARROW-8673
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8675) [C#] Create implementation of ARROW-300 / IPC record batch body buffer compression

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8675:
---

 Summary: [C#] Create implementation of ARROW-300 / IPC record 
batch body buffer compression
 Key: ARROW-8675
 URL: https://issues.apache.org/jira/browse/ARROW-8675
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C#
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8674) [JS] Implement IPC RecordBatch body buffer compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8674:
---

 Summary: [JS] Implement IPC RecordBatch body buffer compression 
from ARROW-300
 Key: ARROW-8674
 URL: https://issues.apache.org/jira/browse/ARROW-8674
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: JavaScript
Reporter: Wes McKinney


This may not be a hard requirement for JS because this would require pulling in 
implementations of LZ4 and ZSTD which not all users may want



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8676) [Rust] Create implementation of IPC RecordBatch body buffer compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8676:
---

 Summary: [Rust] Create implementation of IPC RecordBatch body 
buffer compression from ARROW-300
 Key: ARROW-8676
 URL: https://issues.apache.org/jira/browse/ARROW-8676
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8683) [C++] Add option for user-defined version identifier for Arrow libraries

2020-05-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8683:
---

 Summary: [C++] Add option for user-defined version identifier for 
Arrow libraries
 Key: ARROW-8683
 URL: https://issues.apache.org/jira/browse/ARROW-8683
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


It would be useful to be able to "watermark" shared libraries with e.g. the git 
hash to determine the exact origin of a particular build of the project. The 
version identifier could default to the current git revision but be overridden 
in the CMake invocation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8684) [Packaging][Python] "SystemError: Bad call flags in _PyMethodDef_RawFastCallDict" in Python 3.7.7 on macOS when using pyarrow wheel

2020-05-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8684:
---

 Summary: [Packaging][Python] "SystemError: Bad call flags in 
_PyMethodDef_RawFastCallDict" in Python 3.7.7 on macOS when using pyarrow wheel
 Key: ARROW-8684
 URL: https://issues.apache.org/jira/browse/ARROW-8684
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


[~npr] reported this on the 0.17.0 RC0 vote thread but I have confirmed it 
independently. It was also reported at

https://github.com/apache/arrow/issues/7082

Here are steps to reproduce on macOS:

{code}
conda create -yn py-3.7-defaults python=3.7 -c defaults
conda activate py-3.7-defaults
pip install pyarrow
{code}

Now open the Python interpreter, run {{import pyarrow}}, then exit the 
interpreter ({{python -c "import pyarrow"}} didn't trigger it for me):

{code}
$ python
Python 3.7.7 (default, Mar 26 2020, 10:32:53) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> 
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "pyarrow/types.pxi", line 2638, in 
pyarrow.lib._unregister_py_extension_types
SystemError: Bad call flags in _PyMethodDef_RawFastCallDict. METH_OLDARGS is no 
longer supported!
Segmentation fault: 11
{code}

It fails with Python 3.7.6 when using {{-c conda-forge}} also, so it is not 
particular to defaults.

Frustratingly, the problem doesn't exist in Python 3.7.4 but occurs for me with 
3.7.5, 3.7.6, and 3.7.7. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8700) [C++] static libgflags.a fails to link properly in gcc 4.x

2020-05-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8700:
---

 Summary: [C++] static libgflags.a fails to link properly in gcc 4.x
 Key: ARROW-8700
 URL: https://issues.apache.org/jira/browse/ARROW-8700
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I am seeing this with gcc 4.8 on Ubuntu 18.04

{code}
$ ninja
[55/179] Linking CXX executable release/arrow-json-integration-test
FAILED: release/arrow-json-integration-test 
: && /usr/bin/ccache /usr/bin/g++-4.8  -O3 -DNDEBUG  -Wall -Wno-attributes 
-msse4.2  -O3 -DNDEBUG  -rdynamic 
src/arrow/ipc/CMakeFiles/arrow-json-integration-test.dir/json_integration_test.cc.o
  -o release/arrow-json-integration-test  
-Wl,-rpath,/home/wesm/code/arrow/cpp/build-4.8/release 
release/libarrow_testing.so.18.0.0 release/libarrow.so.18.0.0 -ldl 
release//libgtest_main.so release//libgtest.so release//libgmock.so 
boost_ep-prefix/src/boost_ep/stage/lib/libboost_filesystem.a 
boost_ep-prefix/src/boost_ep/stage/lib/libboost_system.a -ldl 
../bundled/gflags_ep-prefix/src/gflags_ep/lib/libgflags.a 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt 
-lpthread && :
src/arrow/ipc/CMakeFiles/arrow-json-integration-test.dir/json_integration_test.cc.o:
 In function `_GLOBAL__sub_I__ZN3fLS11FLAGS_arrowE':
json_integration_test.cc:(.text.startup+0x1cc): undefined reference to 
`google::FlagRegisterer::FlagRegisterer(char const*, char const*, 
char const*, std::string*, std::string*)'
json_integration_test.cc:(.text.startup+0x275): undefined reference to 
`google::FlagRegisterer::FlagRegisterer(char const*, char const*, 
char const*, std::string*, std::string*)'
json_integration_test.cc:(.text.startup+0x317): undefined reference to 
`google::FlagRegisterer::FlagRegisterer(char const*, char const*, 
char const*, std::string*, std::string*)'
collect2: error: ld returned 1 exit status
[88/179] Building CXX object 
src/arrow/ipc/CMakeFiles/arrow-ipc-read-write-test.dir/read_write_test.cc.o
ninja: build stopped: subcommand failed.
{code}

CMake invocation

{code}
$ cmake .. -GNinja -DARROW_GANDIVA=ON -DARROW_CSV=ON -DARROW_BUILD_TESTS=ON 
-DARROW_BUILD_BENCHMARKS=ON
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8706) [C++][Parquet] Tracking JIRA for PARQUET-1857 (unencrypted INT16_MAX Parquet row group limit)

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8706:
---

 Summary: [C++][Parquet] Tracking JIRA for PARQUET-1857 
(unencrypted INT16_MAX Parquet row group limit)
 Key: ARROW-8706
 URL: https://issues.apache.org/jira/browse/ARROW-8706
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0, 0.17.1


JIRA to make sure this patch gets included in a patch release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8712) [R] Expose strptime timestamp parsing in read_csv conversion options

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8712:
---

 Summary: [R] Expose strptime timestamp parsing in read_csv 
conversion options
 Key: ARROW-8712
 URL: https://issues.apache.org/jira/browse/ARROW-8712
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-8111



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8711) [Python] Expose strptime timestamp parsing in read_csv conversion options

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8711:
---

 Summary: [Python] Expose strptime timestamp parsing in read_csv 
conversion options
 Key: ARROW-8711
 URL: https://issues.apache.org/jira/browse/ARROW-8711
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-8111



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8727) [C++] Do not require struct-initialization of StringConverter to parse strings to other types

2020-05-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8727:
---

 Summary: [C++] Do not require struct-initialization of 
StringConverter to parse strings to other types
 Key: ARROW-8727
 URL: https://issues.apache.org/jira/browse/ARROW-8727
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I ran into this issue while working on refactoring kernels. 
{{StringConverter}} must be initialized to be able to support parametric 
types like Timestamp, but this produces an awkwardness and possibly a 
performance penalty (I haven't measured yet) in inlined functions. 

In any case, I'm refactoring everything to be static non-stateful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8746) [Python][Documentation] Add column limit recommendations Parquet page

2020-05-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8746:
---

 Summary: [Python][Documentation] Add column limit recommendations 
Parquet page
 Key: ARROW-8746
 URL: https://issues.apache.org/jira/browse/ARROW-8746
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Wes McKinney


Users would be well advised to not write columns with large numbers (> 1000) of 
columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8750) [Python] pyarrow.feather.write_feather does not default to lz4 compression if it's available

2020-05-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8750:
---

 Summary: [Python] pyarrow.feather.write_feather does not default 
to lz4 compression if it's available
 Key: ARROW-8750
 URL: https://issues.apache.org/jira/browse/ARROW-8750
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0, 0.17.1


This was my intention but I seem to have implemented it incorrectly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8762) [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation

2020-05-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8762:
---

 Summary: [C++][Gandiva] Replace Gandiva's BitmapAnd with common 
implementation
 Key: ARROW-8762
 URL: https://issues.apache.org/jira/browse/ARROW-8762
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


Now that the arrow/util/bit_util.h implementation has been optimized, we should 
just use that one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8769) [C++] Add convenience methods to access fields by name in StructScalar

2020-05-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8769:
---

 Summary: [C++] Add convenience methods to access fields by name in 
StructScalar
 Key: ARROW-8769
 URL: https://issues.apache.org/jira/browse/ARROW-8769
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This would improve usability of this type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8792) [C++] Improved declarative compute function / kernel development framework, normalize calling conventions

2020-05-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8792:
---

 Summary: [C++] Improved declarative compute function / kernel 
development framework, normalize calling conventions
 Key: ARROW-8792
 URL: https://issues.apache.org/jira/browse/ARROW-8792
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I'm working on a significant revamp of the way that kernels are implemented in 
the project as discussed on the mailing list. PR to follow within the next week 
or sooner

A brief list of features:

* Kernel selection that takes into account the shape of inputs (whether Scalar 
or Array, so you can provide an implementation just for Arrays and a separate 
one just for Scalars if you want)
* More customizable / less monolithic type-to-kernel dispatch
* Browsable function registry (see all available kernels and their input type 
signatures)
* Central code path for type-checking and argument validation
* Central code path for kernel execution on ChunkedArray inputs

There's a lot of JIRAs in the backlog that will follow from this work so I will 
attach those to this issue for visibility but this issue will cover the initial 
refactoring work to port the existing code to the new framework without 
altering existing features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline

2020-05-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8793:
---

 Summary: [C++] BitUtil::SetBitsTo probably doesn't need to be 
inline
 Key: ARROW-8793
 URL: https://issues.apache.org/jira/browse/ARROW-8793
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Inlining this function probably does not yield meaningful performance benefits



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8800) [C++] Split arrow::ChunkedArray into arrow/chunked_array.h

2020-05-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8800:
---

 Summary: [C++] Split arrow::ChunkedArray into arrow/chunked_array.h
 Key: ARROW-8800
 URL: https://issues.apache.org/jira/browse/ARROW-8800
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are plenty of scenarios where ChunkedArray is used separate from Table, 
it would probably make sense to split up the headers, implementation, and unit 
tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8823) [C++] Compute aggregate compression ratio when producing compressed IPC body messages

2020-05-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8823:
---

 Summary: [C++] Compute aggregate compression ratio when producing 
compressed IPC body messages
 Key: ARROW-8823
 URL: https://issues.apache.org/jira/browse/ARROW-8823
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


It would be beneficial to know the exact bytes-on-wire savings once the message 
has been produced. Since this computation would be relatively trivial it would 
not add overhead to the IPC write hot path. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8863) [C++] Array constructors must set ArrayData::null_count to 0 when there is no validity bitmap

2020-05-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8863:
---

 Summary: [C++] Array constructors must set ArrayData::null_count 
to 0 when there is no validity bitmap
 Key: ARROW-8863
 URL: https://issues.apache.org/jira/browse/ARROW-8863
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Many type-specific array constructors incorrectly set the null count to 
unknown. It would be better to set it to 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8866) [C++] Split Type::UNION into Type::SPARSE_UNION and Type::DENSE_UNION

2020-05-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8866:
---

 Summary: [C++] Split Type::UNION into Type::SPARSE_UNION and 
Type::DENSE_UNION
 Key: ARROW-8866
 URL: https://issues.apache.org/jira/browse/ARROW-8866
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Similar to the recent {{Type::INTERVAL}} split, having these two array types 
which have different memory layouts under the same {{Type::type}} value makes 
function dispatch somewhat more complicated. This issue is less critical from 
INTERVAL so this may not be urgent but seems like a good pre-10 change



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8876) [C++] Implement casts from date types to Timestamp

2020-05-20 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8876:
---

 Summary: [C++] Implement casts from date types to Timestamp
 Key: ARROW-8876
 URL: https://issues.apache.org/jira/browse/ARROW-8876
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Discovered the absence of this while refactoring cast.cc. Since we can cast 
Timestamp -> date, we should be able to cast the other way



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8891) [C++] Split non-cast compute kernels into a separate shared library

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8891:
---

 Summary: [C++] Split non-cast compute kernels into a separate 
shared library
 Key: ARROW-8891
 URL: https://issues.apache.org/jira/browse/ARROW-8891
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Since we are going to implement a lot more precompiled kernels, I am not sure 
it makes sense to require all of them to be compiled unconditionally just to 
get access to {{compute::Cast}}, which is needed in many different contexts.

After ARROW-8792 is merged, I would suggest creating a plugin hook for adding a 
bundle of kernels from a shared library outside of libarrow.so, and then moving 
all the object code outside of Cast to something like libarrow_compute.so. Then 
we can change the CMake flags to compile Cast kernels always (?) and then opt 
in to building the additional kernels package separately



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8892) [C++][CI] CI builds for MSVC do not build benchmarks

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8892:
---

 Summary: [C++][CI] CI builds for MSVC do not build benchmarks
 Key: ARROW-8892
 URL: https://issues.apache.org/jira/browse/ARROW-8892
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We must ensure that our benchmarks always build on Windows

I'm fixing these errors for example in ARROW-8792

{code}
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): error 
C2220: warning treated as error - no 'object' file generated
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(256): note: see 
reference to function template instantiation 'void 
parquet::BM_PlainEncodingSpaced(benchmark::State &)' 
being compiled
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(292): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(306): note: see 
reference to function template instantiation 'void 
parquet::BM_PlainDecodingSpaced(benchmark::State &)' 
being compiled
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(299): warning 
C4244: 'argument': conversion from 'int64_t' to 'int', possible loss of data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(300): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
[11/67] Linking CXX executable release\arrow-ipc-read-write-benchmark.exe
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8893) [R] Fix cpplint issues introduced by ARROW-8885

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8893:
---

 Summary: [R] Fix cpplint issues introduced by ARROW-8885
 Key: ARROW-8893
 URL: https://issues.apache.org/jira/browse/ARROW-8893
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
(arrow-3.7) 12:34 ~/code/arrow/r $ ./lint.sh 
/home/wesm/code/arrow/r/src/arrow_types.h:20:  Include the directory when 
naming .h files  [build/include_subdir] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:66:  Add #include  for 
forward  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:83:  Add #include  for 
vector<>  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:95:  Add #include  for 
numeric_limits<>  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:110:  Add #include  for 
shared_ptr<>  [build/include_what_you_use] [4]

/home/wesm/code/arrow/r/src/arrow_exports.h:22:  Include the directory when 
naming .h files  [build/include_subdir] [4]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8894) [C++] C++ array kernels framework and execution buildout (umbrella issue)

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8894:
---

 Summary: [C++] C++ array kernels framework and execution buildout 
(umbrella issue)
 Key: ARROW-8894
 URL: https://issues.apache.org/jira/browse/ARROW-8894
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


In the wake of ARROW-8792, this issue is to serve as an umbrella issue for 
follow up work and associated "buildout" which includes things like:

* Implementation of many new function types and adding new kernel cases to 
existing functions
* Adding implicit casting functionality to function execution
* Creation of "bound" physical arrays expressions
* Pipeline execution (executing multiple kernels while eliminating temporary 
allocation)
* Parallel execution of scalar and aggregate kernels (including parallel 
execution of pipelined kernels)

There's quite a few existing JIRAs in the project that I'll attach to this 
issue and I'll open plenty more issues as things occur to me to help organize 
the work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8895) [C++] Add C++ unit tests for filter function on temporal type inputs, including timestamps

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8895:
---

 Summary: [C++] Add C++ unit tests for filter function on temporal 
type inputs, including timestamps
 Key: ARROW-8895
 URL: https://issues.apache.org/jira/browse/ARROW-8895
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


These are used in R but not tested in C++, so I only found out that I had 
missed adding the kernels to the Filter VectorFunction when running the R test 
suite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8896) [C++] Reimplement dictionary unpacking in Cast kernels using Take

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8896:
---

 Summary: [C++] Reimplement dictionary unpacking in Cast kernels 
using Take
 Key: ARROW-8896
 URL: https://issues.apache.org/jira/browse/ARROW-8896
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


As suggested by [~apitrou] this should yield less code to maintain



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8897) [C++] Determine strategy for propagating failures in initializing built-in function registry in arrow/compute

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8897:
---

 Summary: [C++] Determine strategy for propagating failures in 
initializing built-in function registry in arrow/compute
 Key: ARROW-8897
 URL: https://issues.apache.org/jira/browse/ARROW-8897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


As discussed on https://github.com/apache/arrow/pull/7240, we are using 
{{DCHECK_OK}} to check statuses when initializing the built-in registry. 

We could propagate failures by changing {{arrow::compute::GetFunctionRegistry}} 
to return Result, but there may be other ways



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8898) [C++] Determine desirable maximum length for ExecBatch in pipelined and parallel execution of kernels

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8898:
---

 Summary: [C++] Determine desirable maximum length for ExecBatch in 
pipelined and parallel execution of kernels
 Key: ARROW-8898
 URL: https://issues.apache.org/jira/browse/ARROW-8898
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Maximum lengths like 16K or 64K seem to be popular, but we should write our own 
benchmarks so that we can justify the choice of default chunksize



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8901) [C++] Reduce number of take kernels

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8901:
---

 Summary: [C++] Reduce number of take kernels
 Key: ARROW-8901
 URL: https://issues.apache.org/jira/browse/ARROW-8901
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


After ARROW-8792 we can observe that we are generating 312 take kernels

{code}
In [1]: import pyarrow.compute as pc
  

In [2]: reg = pc.function_registry()
  

In [3]: reg.get_function('take')
  
Out[3]: 
arrow.compute.Function
kind: vector
num_kernels: 312
{code}

You can see them all here: 
https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5

It's probably going to be sufficient to only support int16, int32, and int64 
index types for almost all types and insert implicit casts (once we implement 
implicit-cast-insertion into the execution code) for other index types. If we 
determine that there is some performance hot path where we need to specialize 
for other index types, then we can always do that.

Additionally, we should be able to collapse the date/time kernels since we're 
just moving memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8903) [C++] Implement optimized "unsafe take" for use with selection vectors for kernel execution

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8903:
---

 Summary: [C++] Implement optimized "unsafe take" for use with 
selection vectors for kernel execution
 Key: ARROW-8903
 URL: https://issues.apache.org/jira/browse/ARROW-8903
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Selection vectors constructed from filters do not need to be subjected to 
boundschecking and other such safety checks as are present with a usual 
invocation of {{take}}. So based on the type width of a selection vector 
(uint16?) we should implement highly streamlined take implementations that 
additionally take into consideration that selection vectors are monotonic by 
construction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8904) [Python] Fix usages of deprecated C++ APIs related to child/field

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8904:
---

 Summary: [Python] Fix usages of deprecated C++ APIs related to 
child/field
 Key: ARROW-8904
 URL: https://issues.apache.org/jira/browse/ARROW-8904
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
-- Running cmake --build for pyarrow
cmake --build . --config debug -- -j16
[19/20] Building CXX object CMakeFiles/lib.dir/lib.cpp.o
lib.cpp:20265:85: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_t_1 = __pyx_f_7pyarrow_3lib__normalize_index(__pyx_v_i, 
__pyx_v_self->type->num_children()); if (unlikely(__pyx_t_1 == 
((Py_ssize_t)-1L))) __PYX_ERR(1, 119, __pyx_L1_error)

^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:20276:76: warning: 'child' is deprecated: Use field(i) 
[-Wdeprecated-declarations]
  __pyx_t_2 = 
__pyx_f_7pyarrow_3lib_pyarrow_wrap_field(__pyx_v_self->type->child(__pyx_v_index));
 if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 120, __pyx_L1_error)
   ^
/home/wesm/local/include/arrow/type.h:251:3: note: 'child' has been explicitly 
marked deprecated here
  ARROW_DEPRECATED("Use field(i)")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:20507:56: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->type->num_children()); if 
(unlikely(!__pyx_t_1)) __PYX_ERR(1, 139, __pyx_L1_error)
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:23361:44: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:24039:44: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:58220:37: warning: 'child' is deprecated: Use field(pos) 
[-Wdeprecated-declarations]
  __pyx_v_child = __pyx_v_self->ap->child(__pyx_v_child_id);
^
/home/wesm/local/include/arrow/array.h:1281:3: note: 'child' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use field(pos)")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:58956:74: warning: 'children' is deprecated: Use fields() 
[-Wdeprecated-declarations]
  __pyx_v_child_fields = 
__pyx_v_self->__pyx_base.__pyx_base.type->type->children();
 ^
/home/wesm/local/include/arrow/type.h:257:3: note: 'children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))

[jira] [Created] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8905:
---

 Summary: [C++] Collapse Take APIs from 8 to 1 or 2
 Key: ARROW-8905
 URL: https://issues.apache.org/jira/browse/ARROW-8905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are currently 8 {{Take}} functions with different function signatures. 
Fewer functions would make life easier for binding developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8916) [Python] Add relevant glue for implementing each kind of FunctionOptions

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8916:
---

 Summary: [Python] Add relevant glue for implementing each kind of 
FunctionOptions
 Key: ARROW-8916
 URL: https://issues.apache.org/jira/browse/ARROW-8916
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8917) [C++] Add compute::Function subclass for invoking certain kernels on RecordBatch/Table-valued inputs

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8917:
---

 Summary: [C++] Add compute::Function subclass for invoking certain 
kernels on RecordBatch/Table-valued inputs
 Key: ARROW-8917
 URL: https://issues.apache.org/jira/browse/ARROW-8917
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This will enable bindings to invoke such functions (like take, filter) like

{code}
call_function('take', [table, indices])
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8918) [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8918:
---

 Summary: [C++] Add cast "metafunction" to FunctionRegistry that 
addresses dispatching to appropriate type-specific CastFunction
 Key: ARROW-8918
 URL: https://issues.apache.org/jira/browse/ARROW-8918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


By setting the output type in {{CastOptions}}, we can write

{code}
call_function("cast", [arg], cast_options)
{code}

This simplifies use of casting for binding developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8919) [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that may require implicit casts to invoke

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8919:
---

 Summary: [C++] Add "DispatchBest" APIs to compute::Function that 
selects a kernel that may require implicit casts to invoke
 Key: ARROW-8919
 URL: https://issues.apache.org/jira/browse/ARROW-8919
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Currently we have "DispatchExact" which requires an exact match of input types. 
"DispatchBest" would permit kernel selection with implicit casts required. 
Since multiple kernels may be valid when allowing implicit casts, we will need 
to break ties by estimating the "cost" of the implicit casts. For example, 
casting int8 to int32 is "less expensive" than implicitly casting to int64



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8920) [CI] ARM Travis CI build is failing with archery "case_sensitive" error

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8920:
---

 Summary: [CI] ARM Travis CI build is failing with archery 
"case_sensitive" error
 Key: ARROW-8920
 URL: https://issues.apache.org/jira/browse/ARROW-8920
 Project: Apache Arrow
  Issue Type: Bug
  Components: CI
Reporter: Wes McKinney
 Fix For: 1.0.0


See https://travis-ci.org/github/apache/arrow/jobs/690602409

{code}
Traceback (most recent call last):
  File "/home/travis/.local/bin/archery", line 11, in 
load_entry_point('archery', 'console_scripts', 'archery')()
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
490, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
2853, in load_entry_point
return ep.load()
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
2453, in load
return self.resolve()
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
2459, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/travis/build/apache/arrow/dev/archery/archery/cli.py", line 100, 
in 
case_sensitive=False)
TypeError: __init__() got an unexpected keyword argument 'case_sensitive'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8921) [C++] Add "TypeResolver" class interface to replace current OutputType::Resolver pattern

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8921:
---

 Summary: [C++] Add "TypeResolver" class interface to replace 
current OutputType::Resolver pattern
 Key: ARROW-8921
 URL: https://issues.apache.org/jira/browse/ARROW-8921
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Like the {{TypeMatcher}} for extensible input type checking, TypeResolver will 
allow more flexibility with respect to the output type resolution rule. 
Currently the resolver function is defined as

{code}
using Resolver =
  std::function(KernelContext*, const 
std::vector&)>;
{code}

By changing to a {{TypeResolver}} interface with a virtual Resolve function, we 
also can provide for better human-readability when printing kernel signatures 
(by having {{TypeResolver::ToString}}) and permitting TypeResolvers to be 
compared



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8922:
---

 Summary: [C++] Implement example string scalar kernel function to 
assist with string kernels buildout per ARROW-555
 Key: ARROW-8922
 URL: https://issues.apache.org/jira/browse/ARROW-8922
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I will write a patch to provide an example of creating a string-input 
string-output kernel for executing scalar-valued string functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8923) [C++] Improve usability of arrow::compute::CallFunction by moving ExecContext* argument to end and adding default

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8923:
---

 Summary: [C++] Improve usability of arrow::compute::CallFunction 
by moving ExecContext* argument to end and adding default
 Key: ARROW-8923
 URL: https://issues.apache.org/jira/browse/ARROW-8923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8926) [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8926:
---

 Summary: [C++] Improve docstrings in new public APIs in 
arrow/compute and fix miscellaneous typos
 Key: ARROW-8926
 URL: https://issues.apache.org/jira/browse/ARROW-8926
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I've noticed some imprecise language while reading the headers and some other 
opportunities for improvement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8928) [C++] Measure microperformance associated with data structure access interactions with arrow::compute::ExecBatch

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8928:
---

 Summary: [C++] Measure microperformance associated with data 
structure access interactions with arrow::compute::ExecBatch
 Key: ARROW-8928
 URL: https://issues.apache.org/jira/browse/ARROW-8928
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


{{arrow::compute::ExecBatch}} uses a vector of {{arrow::Datum}} to contain a 
collection of ArrayData and Scalar objects for kernel execution. It would be 
helpful to know how many nanoseconds of overhead is associated with basic 
interactions with this data structure to know the cost of using our vendored 
variant, and other such issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8929:
---

 Summary: [C++] Change compute::Arity:VarArgs min_args default to 0
 Key: ARROW-8929
 URL: https://issues.apache.org/jira/browse/ARROW-8929
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The issue of minimum number of arguments is separate from providing an 
{{InputType}} for input type checking. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8930) [C++] libz.so linking error with liborc.a

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8930:
---

 Summary: [C++] libz.so linking error with liborc.a
 Key: ARROW-8930
 URL: https://issues.apache.org/jira/browse/ARROW-8930
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Wes McKinney
 Fix For: 1.0.0


This is failing in the Travis CI ARM build

https://travis-ci.org/github/apache/arrow/jobs/690722203

{code}
: && /usr/bin/ccache /usr/bin/c++  -Wno-noexcept-type  
-fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
-Wno-sign-conversion -Wno-unused-variable -Werror -march=armv8-a  -g  -rdynamic 
src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/adapter_test.cc.o  
-o debug/arrow-orc-adapter-test  -Wl,-rpath,/build/cpp/debug  
debug/libarrow_testing.a  debug/libarrow.a  debug//libgtest_maind.so  
debug//libgtestd.so  /usr/lib/aarch64-linux-gnu/libsnappy.so.1.1.8  
/usr/lib/aarch64-linux-gnu/liblz4.so  /usr/lib/aarch64-linux-gnu/libz.so  
-lpthread  -ldl  orc_ep-install/lib/liborc.a  
/usr/lib/aarch64-linux-gnu/libssl.so  /usr/lib/aarch64-linux-gnu/libcrypto.so  
/usr/lib/aarch64-linux-gnu/libbrotlienc.so  
/usr/lib/aarch64-linux-gnu/libbrotlidec.so  
/usr/lib/aarch64-linux-gnu/libbrotlicommon.so  
/usr/lib/aarch64-linux-gnu/libbz2.so  /usr/lib/aarch64-linux-gnu/libzstd.so  
/usr/lib/aarch64-linux-gnu/libprotobuf.so  
/usr/lib/aarch64-linux-gnu/libglog.so  
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a  -pthread  -lrt 
&& :
/usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): undefined reference 
to symbol 'inflateEnd'
/usr/bin/ld: /usr/lib/aarch64-linux-gnu/libz.so: error adding symbols: DSO 
missing from command line
collect2: error: ld returned 1 exit status
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8934) [C++] Add timestamp subtract kernel aliased to int64 subtract implementation

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8934:
---

 Summary: [C++] Add timestamp subtract kernel aliased to int64 
subtract implementation
 Key: ARROW-8934
 URL: https://issues.apache.org/jira/browse/ARROW-8934
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We can use the same scalar exec function for int64 subtraction as well as 
{{(array[TIMESTAMP], array[TIMESTAMP]) -> duration}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8935) [Python] Add necessary plumbing to enable Numba-generated functions to be registered as functions in the global C++ function/kernels registry

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8935:
---

 Summary: [Python] Add necessary plumbing to enable Numba-generated 
functions to be registered as functions in the global C++ function/kernels 
registry
 Key: ARROW-8935
 URL: https://issues.apache.org/jira/browse/ARROW-8935
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8936:
---

 Summary: [C++] Parallelize execution of 
arrow::compute::ScalarFunction
 Key: ARROW-8936
 URL: https://issues.apache.org/jira/browse/ARROW-8936
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8937:
---

 Summary: [C++] Add "parse_strptime" function for string to 
timestamp conversions using the kernels framework
 Key: ARROW-8937
 URL: https://issues.apache.org/jira/browse/ARROW-8937
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This should be relatively straightforward to implement using the new kernels 
framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8933) [C++] Reduce generated code in vector_hash.cc

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8933:
---

 Summary: [C++] Reduce generated code in vector_hash.cc
 Key: ARROW-8933
 URL: https://issues.apache.org/jira/browse/ARROW-8933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Since hashing doesn't need to know about logical types, we can do the following:

* Use same generated code for both BinaryType and StringType
* Use same generated code for primitive types having the same byte width

These two changes should reduce binary size and improve compilation speed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8938) [R] Provide binding and argument packing to use arrow::compute::CallFunction to use any compute kernel from R dynamically

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8938:
---

 Summary: [R] Provide binding and argument packing to use 
arrow::compute::CallFunction to use any compute kernel from R dynamically
 Key: ARROW-8938
 URL: https://issues.apache.org/jira/browse/ARROW-8938
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This will drastically simplify exposing new functions to R users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8939) [C++] Arrow C++ Data Frame-style programming interface for analytics (umbrella issue)

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8939:
---

 Summary: [C++] Arrow C++ Data Frame-style programming interface 
for analytics (umbrella issue)
 Key: ARROW-8939
 URL: https://issues.apache.org/jira/browse/ARROW-8939
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This is an umbrella issue for the "C++ Data Frame" project that has been 
discussed on the mailing list with the following Google docs overview

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit

I will attach issues to this JIRA to help organize and track the project as we 
make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8945) [Python] An independent Cython package for projects that want to program against the C data interface

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8945:
---

 Summary: [Python] An independent Cython package for projects that 
want to program against the C data interface
 Key: ARROW-8945
 URL: https://issues.apache.org/jira/browse/ARROW-8945
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


I've been thinking it would be useful to have a minimal Cython package, call it 
"cyarrow", containing some pxd files and a small amount of compiled pyx code 
(using a C compiler only) that enables projects written in Cython to interact 
with Arrow datasets in minimal ways (for example, iterating over their values, 
interacting with dictionary-encoded/categorical arrays) that don't amount to 
reimplementation of the "hard stuff" where they would want to utilize pyarrow 
or the C++ library instead. Otherwise, every Python project that has compiled 
code in Cython and wants to use the C interface would have to create their own 
minimal implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >