[jira] [Created] (ARROW-9122) [C++] Adapt ascii_lower/ascii_upper bulk transforms to work on sliced arrays

2020-06-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9122:
---

 Summary: [C++] Adapt ascii_lower/ascii_upper bulk transforms to 
work on sliced arrays
 Key: ARROW-9122
 URL: https://issues.apache.org/jira/browse/ARROW-9122
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


See comments at https://github.com/apache/arrow/pull/7418#discussion_r439754427

Also add unit tests to verify that only the referenced data slice has been 
transformed in the result



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9118) [C++] Add more general BoundsCheck function that also checks for arbitrary lower limits in integer arrays

2020-06-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9118:
---

 Summary: [C++] Add more general BoundsCheck function that also 
checks for arbitrary lower limits in integer arrays
 Key: ARROW-9118
 URL: https://issues.apache.org/jira/browse/ARROW-9118
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


See ARROW-9083. The current {{IndexBoundsCheck}} is specialized to skip a 
comparison for unsigned integers and uses 0 as the lower bound for signed 
integers. This could be generalized so that we could check e.g. if int64 values 
will fit in the int32 range



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9115) [C++] Process data buffers in batch in ascii_lower / ascii_upper kernels rather than using string_view value iteration

2020-06-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9115:
---

 Summary: [C++] Process data buffers in batch in ascii_lower / 
ascii_upper kernels rather than using string_view value iteration
 Key: ARROW-9115
 URL: https://issues.apache.org/jira/browse/ARROW-9115
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Also add a benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9092) [C++] gandiva-decimal-test hangs with LLVM 9

2020-06-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9092:
---

 Summary: [C++] gandiva-decimal-test hangs with LLVM 9
 Key: ARROW-9092
 URL: https://issues.apache.org/jira/browse/ARROW-9092
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I built Gandiva C++ unittests with LLVM 9 on Ubuntu 18.04 and 
gandiva-decimal-test hangs forever



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9091) [C++] Utilize function's default options when passing no options to CallFunction to a function that requires them

2020-06-10 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9091:
---

 Summary: [C++] Utilize function's default options when passing no 
options to CallFunction to a function that requires them
 Key: ARROW-9091
 URL: https://issues.apache.org/jira/browse/ARROW-9091
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Otherwise benign usage of {{CallFunction}} can cause an unintuitive segfault in 
some cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9085) [C++][CI] Appveyor CI test failures

2020-06-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9085:
---

 Summary: [C++][CI] Appveyor CI test failures
 Key: ARROW-9085
 URL: https://issues.apache.org/jira/browse/ARROW-9085
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


See 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/33417919

These seem to have been introduced by 

https://github.com/apache/arrow/commit/b058cf0d1c26ad7984c104bb84322cc7dcc66f00



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9080) [C++] arrow::AllocateBuffer returns a Result>

2020-06-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9080:
---

 Summary: [C++] arrow::AllocateBuffer returns a 
Result>
 Key: ARROW-9080
 URL: https://issues.apache.org/jira/browse/ARROW-9080
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This seemed counterintuitive to me since using Buffers almost anywhere requires 
a shared_ptr



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9075) [C++] Optimize Filter implementation

2020-06-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9075:
---

 Summary: [C++] Optimize Filter implementation
 Key: ARROW-9075
 URL: https://issues.apache.org/jira/browse/ARROW-9075
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I split this off from ARROW-5760 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9067) [C++] Create reusable branchless / vectorized index boundschecking functions

2020-06-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9067:
---

 Summary: [C++] Create reusable branchless / vectorized index 
boundschecking functions
 Key: ARROW-9067
 URL: https://issues.apache.org/jira/browse/ARROW-9067
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


It is possible to do branch-free index boundschecking in batches for better 
performance. 

I am implementing this as part of the Take/Filter optimization (so please wait 
until I have PRs up for this work), but these functions can be moved somewhere 
more general purpose and used in places where we are currently boundschecking 
inside inner loops.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9045) [C++] Improve and expand Take/Filter benchmarks

2020-06-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9045:
---

 Summary: [C++] Improve and expand Take/Filter benchmarks
 Key: ARROW-9045
 URL: https://issues.apache.org/jira/browse/ARROW-9045
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I'm putting this up as a separate patch for review



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9043) [Go] Temporarily copy LICENSE.txt to go/

2020-06-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9043:
---

 Summary: [Go] Temporarily copy LICENSE.txt to go/
 Key: ARROW-9043
 URL: https://issues.apache.org/jira/browse/ARROW-9043
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Wes McKinney
 Fix For: 1.0.0


{{go mod}} needs to find a license file in the root of the Go module. In the 
future "go mod" may be able to follow symlinks in which case this can be 
replaced by a symlink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9034) [C++] Implement binary (two bitmap) version of BitBlockCounter

2020-06-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9034:
---

 Summary: [C++] Implement binary (two bitmap) version of 
BitBlockCounter
 Key: ARROW-9034
 URL: https://issues.apache.org/jira/browse/ARROW-9034
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


The current BitBlockCounter from ARROW-9029 is useful for unary operations. 
Some operations involve multiple bitmaps and so it's useful to be able to 
determine the block popcounts of the AND of the respective words in the 
bitmaps. So each returned block would contain the number of bits that are set 
in both bitmaps at the same locations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9033) [Python] Add tests to verify that one can build a C++ extension against the manylinux1 wheels

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9033:
---

 Summary: [Python] Add tests to verify that one can build a C++ 
extension against the manylinux1 wheels
 Key: ARROW-9033
 URL: https://issues.apache.org/jira/browse/ARROW-9033
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Some project want to be able to use the Python wheels to build other Python 
packages with C++ extensions that need to link against libarrow.so. It would be 
great if someone would add automated tests to ensure that our wheel builds can 
be used successfully in this fashion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9032) [C++] Split arrow/util/bit_util.h into multiple header files

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9032:
---

 Summary: [C++] Split arrow/util/bit_util.h into multiple header 
files
 Key: ARROW-9032
 URL: https://issues.apache.org/jira/browse/ARROW-9032
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This header has grown quite large and any given compilation unit's use of it is 
likely limited to only a couple of functions or classes. I suspect it would 
improve compilation time to split up this header into a few headers organized 
by frequency of code use. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9031) [R] Implement conversion from Type::UINT64 to R vector

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9031:
---

 Summary: [R] Implement conversion from Type::UINT64 to R vector
 Key: ARROW-9031
 URL: https://issues.apache.org/jira/browse/ARROW-9031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This case is not handled in array_to_vector.cpp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9030) [Python] Clean up some usages of pyarrow.compat, move some common functions/symbols to lib.pyx

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9030:
---

 Summary: [Python] Clean up some usages of pyarrow.compat, move 
some common functions/symbols to lib.pyx
 Key: ARROW-9030
 URL: https://issues.apache.org/jira/browse/ARROW-9030
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


I started doing this while looking into ARROW-4633



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9029) [C++] Implement BitmapScanner interface to accelerate processing of mostly-not-null data

2020-06-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9029:
---

 Summary: [C++] Implement BitmapScanner interface to accelerate 
processing of mostly-not-null data
 Key: ARROW-9029
 URL: https://issues.apache.org/jira/browse/ARROW-9029
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


In analytics, it is common for data to be all not-null or mostly not-null. Data 
with > 50% nulls tends to be more exceptional. In this might, our 
{{BitmapReader}} class which allows iteration of each bit in a bitmap can be 
wasteful for mostly set validity bitmaps.

I propose instead a new interface for use in kernel implementations, for lack 
of a better term {{BitmapScanner}}. This works as follows:

* Uses popcount to accumulate consecutive 64-bit words from a bitmap where all 
values are set, up to some limit (e.g. anywhere from 8 to 128 words -- we can 
use benchmarks to determine what is a good limit). The length of this "all-on" 
run is returned to the caller in a single function call, so that this "run" of 
data can be processed without any bit-by-bit bitmap checking
* If words containing unset bits is encountered, the scanner will similarly 
accumulate non-full words until the next full word is encountered or a limit is 
hit. The length of this "has nulls" run is returned to the caller, which then 
proceeds bit-by-bit to process the data

For data with a lot of nulls, this may degrade performance somewhat but 
probably not that much empirically. However, data that is mostly-not-null 
should benefit from this. 

This BitmapScanner utility can probably also be used to accelerate the 
implementation of Filter for mostly-not-null data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9018) [C++] Remove APIs that were deprecated in 0.17.x and prior

2020-06-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9018:
---

 Summary: [C++] Remove APIs that were deprecated in 0.17.x and prior
 Key: ARROW-9018
 URL: https://issues.apache.org/jira/browse/ARROW-9018
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9006) [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9006:
---

 Summary: [C++] Use Cast kernels to implement Scalar::Parse and 
Scalar::CastTo
 Key: ARROW-9006
 URL: https://issues.apache.org/jira/browse/ARROW-9006
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We should not maintain distinct (and possibly differently behaving) 
implementations of elementwise array casting and scalar casting. The new 
kernels framework provides for relatively easily generating kernels that can 
process arrays or scalars. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9003) [C++] Add VectorFunction wrapping arrow::Concatenate

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9003:
---

 Summary: [C++] Add VectorFunction wrapping arrow::Concatenate
 Key: ARROW-9003
 URL: https://issues.apache.org/jira/browse/ARROW-9003
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This would be a varargs function 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9001) [R] Box outputs as correct type in call_function

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9001:
---

 Summary: [R] Box outputs as correct type in call_function
 Key: ARROW-9001
 URL: https://issues.apache.org/jira/browse/ARROW-9001
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This would prevent segfaults by putting the SEXP in the wrong kind of R6 
container



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8999:
---

 Summary: [Python][C++] Non-deterministic segfault in "AMD64 MacOS 
10.15 Python 3.7" build
 Key: ARROW-8999
 URL: https://issues.apache.org/jira/browse/ARROW-8999
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


I've been seeing this segfault periodically the last week, does anyone have an 
idea what might be wrong?

https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8998) [Python] Make NumPy an optional runtime dependency

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8998:
---

 Summary: [Python] Make NumPy an optional runtime dependency
 Key: ARROW-8998
 URL: https://issues.apache.org/jira/browse/ARROW-8998
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


Since in the relatively near future, one will be able to do non-trivial 
analytical operations and query processing natively on Arrow data structures 
through pyarrow, it does not make sense to require users to always install 
NumPy when that install pyarrow. I propose to split the NumPy-depending parts 
of libarrow_python into a libarrow_numpy (which also must be bundled) and 
moving this part of the codebase into a separate Cython module.

This refactoring should be relatively painless though there may be a number of 
packaging details to chase up since this would introduce a new shared library 
to be installed in various packaging targets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8995) [C++] Scalar formatting code used in array/diff.cc should be reusable

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8995:
---

 Summary: [C++] Scalar formatting code used in array/diff.cc should 
be reusable
 Key: ARROW-8995
 URL: https://issues.apache.org/jira/browse/ARROW-8995
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Formatting Array values as strings is not specific to the diff.cc code, so it 
may make sense to move this code elsewhere where it can be used generally 
(perhaps a method like {{Array::FormatValue}}?). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8994:
---

 Summary: [C++] Disable include-what-you-use cpplint lint checks
 Key: ARROW-8994
 URL: https://issues.apache.org/jira/browse/ARROW-8994
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


If we want to be serious about IWYU, it would be better to use IWYU directly. 
The minimal checks that IWYU does can be a nuisance rather than addressing the 
problem holistically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8991) [C++][Compute] Add scalar_hash function

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8991:
---

 Summary: [C++][Compute] Add scalar_hash function
 Key: ARROW-8991
 URL: https://issues.apache.org/jira/browse/ARROW-8991
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The purpose of this function is to compute 32- or 64-bit hash values for each 
cell in an Array. Hashes for nested types can be computed recursively by 
combining the hash values of their children



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8990:
---

 Summary: [C++] Benchmark hash table against thirdparty options, 
possibly vendor a thirdparty hash table library
 Key: ARROW-8990
 URL: https://issues.apache.org/jira/browse/ARROW-8990
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


While we have our own hash table implementation, it would be worthwhile to set 
up some benchmarks so that we can compare against std::unordered_map and some 
other thirdparty libraries for hash tables to know whether we should possibly 
use a thirdparty library. See e.g.

https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8989) [C++] Document available functions in FunctionRegistry

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8989:
---

 Summary: [C++] Document available functions in FunctionRegistry
 Key: ARROW-8989
 URL: https://issues.apache.org/jira/browse/ARROW-8989
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Create a compute page in the C++ section of the Sphinx docs and make a list of 
the available functions and what they do



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8985) [Format] Add "byte width" field with default of 16 to Decimal Flatbuffers type for forward compatibility

2020-05-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8985:
---

 Summary: [Format] Add "byte width" field with default of 16 to 
Decimal Flatbuffers type for forward compatibility
 Key: ARROW-8985
 URL: https://issues.apache.org/jira/browse/ARROW-8985
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


This will permit larger or smaller decimals to be added to the format later 
without having to add a new Type union value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)

2020-05-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8970:
---

 Summary: [C++] Reduce shared library code size (umbrella issue)
 Key: ARROW-8970
 URL: https://issues.apache.org/jira/browse/ARROW-8970
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We're reaching a point where we may need to be careful about decisions that 
increase code size:

* Instantiating too many templates for code that isn't performance sensitive
* Inlining functions that don't need to be inline

Code size tends to correlate also with compilation times, but not always.

I'll use this umbrella issue to organize issues related to reducing compiled 
code size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc

2020-05-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8969:
---

 Summary: [C++] Reduce generated code in 
compute/kernels/scalar_compare.cc
 Key: ARROW-8969
 URL: https://issues.apache.org/jira/browse/ARROW-8969
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We are instantiating templates in this module for cases that, byte-wise, do the 
exact same comparison. For example:

* For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels 
for signed int / unsigned int / floating point types of the same byte width
* TimestampType can reuse int64 kernels, similarly for other date/time types
* BinaryType/StringType can share kernels

etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file

2020-05-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8966:
---

 Summary: [C++] Move arrow::ArrayData to a separate header file
 Key: ARROW-8966
 URL: https://issues.apache.org/jira/browse/ARROW-8966
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are code modules (such as compute kernels) that only require ArrayData 
for doing computations, so pulling in all the code in array.h is not necessary. 
There are probably other code paths that might benefit from this also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8961:
---

 Summary: [C++] Vendor utf8proc library
 Key: ARROW-8961
 URL: https://issues.apache.org/jira/browse/ARROW-8961
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is a minimal MIT-licensed library for UTF-8 data processing originally 
developed for use in Julia

https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8956:
---

 Summary: [C++] arrow::ScalarEquals returns false when values are 
both null
 Key: ARROW-8956
 URL: https://issues.apache.org/jira/browse/ARROW-8956
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I wasn't sure if this was deliberate but it appeared while writing unit tests 
and so wanted to check what was the intention before changing it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8955) [C++] Use kernels for casting Scalar values instead of bespoke implementation

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8955:
---

 Summary: [C++] Use kernels for casting Scalar values instead of 
bespoke implementation
 Key: ARROW-8955
 URL: https://issues.apache.org/jira/browse/ARROW-8955
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


See details of casting in arrow/scalar.cc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8951:
---

 Summary: [C++] Fix compiler warning in 
compute/kernels/scalar_cast_temporal.cc
 Key: ARROW-8951
 URL: https://issues.apache.org/jira/browse/ARROW-8951
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The kernel functor can return an uninitialized value on errors

{code}
../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT 
arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*, 
ARG0) const [with OUT = long int; ARG0 = 
nonstd::sv_lite::basic_string_view]’:
../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: ‘result’ 
may be used uninitialized in this function [-Wmaybe-uninitialized]
 return result;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8945) [Python] An independent Cython package for projects that want to program against the C data interface

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8945:
---

 Summary: [Python] An independent Cython package for projects that 
want to program against the C data interface
 Key: ARROW-8945
 URL: https://issues.apache.org/jira/browse/ARROW-8945
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


I've been thinking it would be useful to have a minimal Cython package, call it 
"cyarrow", containing some pxd files and a small amount of compiled pyx code 
(using a C compiler only) that enables projects written in Cython to interact 
with Arrow datasets in minimal ways (for example, iterating over their values, 
interacting with dictionary-encoded/categorical arrays) that don't amount to 
reimplementation of the "hard stuff" where they would want to utilize pyarrow 
or the C++ library instead. Otherwise, every Python project that has compiled 
code in Cython and wants to use the C interface would have to create their own 
minimal implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8939) [C++] Arrow C++ Data Frame-style programming interface for analytics (umbrella issue)

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8939:
---

 Summary: [C++] Arrow C++ Data Frame-style programming interface 
for analytics (umbrella issue)
 Key: ARROW-8939
 URL: https://issues.apache.org/jira/browse/ARROW-8939
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This is an umbrella issue for the "C++ Data Frame" project that has been 
discussed on the mailing list with the following Google docs overview

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit

I will attach issues to this JIRA to help organize and track the project as we 
make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8938) [R] Provide binding and argument packing to use arrow::compute::CallFunction to use any compute kernel from R dynamically

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8938:
---

 Summary: [R] Provide binding and argument packing to use 
arrow::compute::CallFunction to use any compute kernel from R dynamically
 Key: ARROW-8938
 URL: https://issues.apache.org/jira/browse/ARROW-8938
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This will drastically simplify exposing new functions to R users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8933) [C++] Reduce generated code in vector_hash.cc

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8933:
---

 Summary: [C++] Reduce generated code in vector_hash.cc
 Key: ARROW-8933
 URL: https://issues.apache.org/jira/browse/ARROW-8933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Since hashing doesn't need to know about logical types, we can do the following:

* Use same generated code for both BinaryType and StringType
* Use same generated code for primitive types having the same byte width

These two changes should reduce binary size and improve compilation speed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8937:
---

 Summary: [C++] Add "parse_strptime" function for string to 
timestamp conversions using the kernels framework
 Key: ARROW-8937
 URL: https://issues.apache.org/jira/browse/ARROW-8937
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This should be relatively straightforward to implement using the new kernels 
framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8936:
---

 Summary: [C++] Parallelize execution of 
arrow::compute::ScalarFunction
 Key: ARROW-8936
 URL: https://issues.apache.org/jira/browse/ARROW-8936
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8935) [Python] Add necessary plumbing to enable Numba-generated functions to be registered as functions in the global C++ function/kernels registry

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8935:
---

 Summary: [Python] Add necessary plumbing to enable Numba-generated 
functions to be registered as functions in the global C++ function/kernels 
registry
 Key: ARROW-8935
 URL: https://issues.apache.org/jira/browse/ARROW-8935
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8934) [C++] Add timestamp subtract kernel aliased to int64 subtract implementation

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8934:
---

 Summary: [C++] Add timestamp subtract kernel aliased to int64 
subtract implementation
 Key: ARROW-8934
 URL: https://issues.apache.org/jira/browse/ARROW-8934
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We can use the same scalar exec function for int64 subtraction as well as 
{{(array[TIMESTAMP], array[TIMESTAMP]) -> duration}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8930) [C++] libz.so linking error with liborc.a

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8930:
---

 Summary: [C++] libz.so linking error with liborc.a
 Key: ARROW-8930
 URL: https://issues.apache.org/jira/browse/ARROW-8930
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Wes McKinney
 Fix For: 1.0.0


This is failing in the Travis CI ARM build

https://travis-ci.org/github/apache/arrow/jobs/690722203

{code}
: && /usr/bin/ccache /usr/bin/c++  -Wno-noexcept-type  
-fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
-Wno-sign-conversion -Wno-unused-variable -Werror -march=armv8-a  -g  -rdynamic 
src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/adapter_test.cc.o  
-o debug/arrow-orc-adapter-test  -Wl,-rpath,/build/cpp/debug  
debug/libarrow_testing.a  debug/libarrow.a  debug//libgtest_maind.so  
debug//libgtestd.so  /usr/lib/aarch64-linux-gnu/libsnappy.so.1.1.8  
/usr/lib/aarch64-linux-gnu/liblz4.so  /usr/lib/aarch64-linux-gnu/libz.so  
-lpthread  -ldl  orc_ep-install/lib/liborc.a  
/usr/lib/aarch64-linux-gnu/libssl.so  /usr/lib/aarch64-linux-gnu/libcrypto.so  
/usr/lib/aarch64-linux-gnu/libbrotlienc.so  
/usr/lib/aarch64-linux-gnu/libbrotlidec.so  
/usr/lib/aarch64-linux-gnu/libbrotlicommon.so  
/usr/lib/aarch64-linux-gnu/libbz2.so  /usr/lib/aarch64-linux-gnu/libzstd.so  
/usr/lib/aarch64-linux-gnu/libprotobuf.so  
/usr/lib/aarch64-linux-gnu/libglog.so  
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a  -pthread  -lrt 
&& :
/usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): undefined reference 
to symbol 'inflateEnd'
/usr/bin/ld: /usr/lib/aarch64-linux-gnu/libz.so: error adding symbols: DSO 
missing from command line
collect2: error: ld returned 1 exit status
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8929:
---

 Summary: [C++] Change compute::Arity:VarArgs min_args default to 0
 Key: ARROW-8929
 URL: https://issues.apache.org/jira/browse/ARROW-8929
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The issue of minimum number of arguments is separate from providing an 
{{InputType}} for input type checking. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8928) [C++] Measure microperformance associated with data structure access interactions with arrow::compute::ExecBatch

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8928:
---

 Summary: [C++] Measure microperformance associated with data 
structure access interactions with arrow::compute::ExecBatch
 Key: ARROW-8928
 URL: https://issues.apache.org/jira/browse/ARROW-8928
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


{{arrow::compute::ExecBatch}} uses a vector of {{arrow::Datum}} to contain a 
collection of ArrayData and Scalar objects for kernel execution. It would be 
helpful to know how many nanoseconds of overhead is associated with basic 
interactions with this data structure to know the cost of using our vendored 
variant, and other such issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8926) [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8926:
---

 Summary: [C++] Improve docstrings in new public APIs in 
arrow/compute and fix miscellaneous typos
 Key: ARROW-8926
 URL: https://issues.apache.org/jira/browse/ARROW-8926
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I've noticed some imprecise language while reading the headers and some other 
opportunities for improvement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8923) [C++] Improve usability of arrow::compute::CallFunction by moving ExecContext* argument to end and adding default

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8923:
---

 Summary: [C++] Improve usability of arrow::compute::CallFunction 
by moving ExecContext* argument to end and adding default
 Key: ARROW-8923
 URL: https://issues.apache.org/jira/browse/ARROW-8923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8922:
---

 Summary: [C++] Implement example string scalar kernel function to 
assist with string kernels buildout per ARROW-555
 Key: ARROW-8922
 URL: https://issues.apache.org/jira/browse/ARROW-8922
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I will write a patch to provide an example of creating a string-input 
string-output kernel for executing scalar-valued string functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8921) [C++] Add "TypeResolver" class interface to replace current OutputType::Resolver pattern

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8921:
---

 Summary: [C++] Add "TypeResolver" class interface to replace 
current OutputType::Resolver pattern
 Key: ARROW-8921
 URL: https://issues.apache.org/jira/browse/ARROW-8921
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Like the {{TypeMatcher}} for extensible input type checking, TypeResolver will 
allow more flexibility with respect to the output type resolution rule. 
Currently the resolver function is defined as

{code}
using Resolver =
  std::function(KernelContext*, const 
std::vector&)>;
{code}

By changing to a {{TypeResolver}} interface with a virtual Resolve function, we 
also can provide for better human-readability when printing kernel signatures 
(by having {{TypeResolver::ToString}}) and permitting TypeResolvers to be 
compared



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8920) [CI] ARM Travis CI build is failing with archery "case_sensitive" error

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8920:
---

 Summary: [CI] ARM Travis CI build is failing with archery 
"case_sensitive" error
 Key: ARROW-8920
 URL: https://issues.apache.org/jira/browse/ARROW-8920
 Project: Apache Arrow
  Issue Type: Bug
  Components: CI
Reporter: Wes McKinney
 Fix For: 1.0.0


See https://travis-ci.org/github/apache/arrow/jobs/690602409

{code}
Traceback (most recent call last):
  File "/home/travis/.local/bin/archery", line 11, in 
load_entry_point('archery', 'console_scripts', 'archery')()
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
490, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
2853, in load_entry_point
return ep.load()
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
2453, in load
return self.resolve()
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 
2459, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/travis/build/apache/arrow/dev/archery/archery/cli.py", line 100, 
in 
case_sensitive=False)
TypeError: __init__() got an unexpected keyword argument 'case_sensitive'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8919) [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that may require implicit casts to invoke

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8919:
---

 Summary: [C++] Add "DispatchBest" APIs to compute::Function that 
selects a kernel that may require implicit casts to invoke
 Key: ARROW-8919
 URL: https://issues.apache.org/jira/browse/ARROW-8919
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Currently we have "DispatchExact" which requires an exact match of input types. 
"DispatchBest" would permit kernel selection with implicit casts required. 
Since multiple kernels may be valid when allowing implicit casts, we will need 
to break ties by estimating the "cost" of the implicit casts. For example, 
casting int8 to int32 is "less expensive" than implicitly casting to int64



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8918) [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8918:
---

 Summary: [C++] Add cast "metafunction" to FunctionRegistry that 
addresses dispatching to appropriate type-specific CastFunction
 Key: ARROW-8918
 URL: https://issues.apache.org/jira/browse/ARROW-8918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


By setting the output type in {{CastOptions}}, we can write

{code}
call_function("cast", [arg], cast_options)
{code}

This simplifies use of casting for binding developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8917) [C++] Add compute::Function subclass for invoking certain kernels on RecordBatch/Table-valued inputs

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8917:
---

 Summary: [C++] Add compute::Function subclass for invoking certain 
kernels on RecordBatch/Table-valued inputs
 Key: ARROW-8917
 URL: https://issues.apache.org/jira/browse/ARROW-8917
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This will enable bindings to invoke such functions (like take, filter) like

{code}
call_function('take', [table, indices])
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8916) [Python] Add relevant glue for implementing each kind of FunctionOptions

2020-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8916:
---

 Summary: [Python] Add relevant glue for implementing each kind of 
FunctionOptions
 Key: ARROW-8916
 URL: https://issues.apache.org/jira/browse/ARROW-8916
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8905:
---

 Summary: [C++] Collapse Take APIs from 8 to 1 or 2
 Key: ARROW-8905
 URL: https://issues.apache.org/jira/browse/ARROW-8905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are currently 8 {{Take}} functions with different function signatures. 
Fewer functions would make life easier for binding developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8904) [Python] Fix usages of deprecated C++ APIs related to child/field

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8904:
---

 Summary: [Python] Fix usages of deprecated C++ APIs related to 
child/field
 Key: ARROW-8904
 URL: https://issues.apache.org/jira/browse/ARROW-8904
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
-- Running cmake --build for pyarrow
cmake --build . --config debug -- -j16
[19/20] Building CXX object CMakeFiles/lib.dir/lib.cpp.o
lib.cpp:20265:85: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_t_1 = __pyx_f_7pyarrow_3lib__normalize_index(__pyx_v_i, 
__pyx_v_self->type->num_children()); if (unlikely(__pyx_t_1 == 
((Py_ssize_t)-1L))) __PYX_ERR(1, 119, __pyx_L1_error)

^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:20276:76: warning: 'child' is deprecated: Use field(i) 
[-Wdeprecated-declarations]
  __pyx_t_2 = 
__pyx_f_7pyarrow_3lib_pyarrow_wrap_field(__pyx_v_self->type->child(__pyx_v_index));
 if (unlikely(!__pyx_t_2)) __PYX_ERR(1, 120, __pyx_L1_error)
   ^
/home/wesm/local/include/arrow/type.h:251:3: note: 'child' has been explicitly 
marked deprecated here
  ARROW_DEPRECATED("Use field(i)")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:20507:56: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_t_1 = __Pyx_PyInt_From_int(__pyx_v_self->type->num_children()); if 
(unlikely(!__pyx_t_1)) __PYX_ERR(1, 139, __pyx_L1_error)
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:23361:44: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:24039:44: warning: 'num_children' is deprecated: Use num_fields() 
[-Wdeprecated-declarations]
  __pyx_r = __pyx_v_self->__pyx_base.type->num_children();
   ^
/home/wesm/local/include/arrow/type.h:263:3: note: 'num_children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use num_fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:58220:37: warning: 'child' is deprecated: Use field(pos) 
[-Wdeprecated-declarations]
  __pyx_v_child = __pyx_v_self->ap->child(__pyx_v_child_id);
^
/home/wesm/local/include/arrow/array.h:1281:3: note: 'child' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use field(pos)")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))
   ^
lib.cpp:58956:74: warning: 'children' is deprecated: Use fields() 
[-Wdeprecated-declarations]
  __pyx_v_child_fields = 
__pyx_v_self->__pyx_base.__pyx_base.type->type->children();
 ^
/home/wesm/local/include/arrow/type.h:257:3: note: 'children' has been 
explicitly marked deprecated here
  ARROW_DEPRECATED("Use fields()")
  ^
/home/wesm/local/include/arrow/util/macros.h:104:48: note: expanded from macro 
'ARROW_DEPRECATED'
#  define ARROW_DEPRECATED(...) __attribute__((deprecated(__VA_ARGS__)))

[jira] [Created] (ARROW-8903) [C++] Implement optimized "unsafe take" for use with selection vectors for kernel execution

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8903:
---

 Summary: [C++] Implement optimized "unsafe take" for use with 
selection vectors for kernel execution
 Key: ARROW-8903
 URL: https://issues.apache.org/jira/browse/ARROW-8903
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Selection vectors constructed from filters do not need to be subjected to 
boundschecking and other such safety checks as are present with a usual 
invocation of {{take}}. So based on the type width of a selection vector 
(uint16?) we should implement highly streamlined take implementations that 
additionally take into consideration that selection vectors are monotonic by 
construction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8901) [C++] Reduce number of take kernels

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8901:
---

 Summary: [C++] Reduce number of take kernels
 Key: ARROW-8901
 URL: https://issues.apache.org/jira/browse/ARROW-8901
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


After ARROW-8792 we can observe that we are generating 312 take kernels

{code}
In [1]: import pyarrow.compute as pc
  

In [2]: reg = pc.function_registry()
  

In [3]: reg.get_function('take')
  
Out[3]: 
arrow.compute.Function
kind: vector
num_kernels: 312
{code}

You can see them all here: 
https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5

It's probably going to be sufficient to only support int16, int32, and int64 
index types for almost all types and insert implicit casts (once we implement 
implicit-cast-insertion into the execution code) for other index types. If we 
determine that there is some performance hot path where we need to specialize 
for other index types, then we can always do that.

Additionally, we should be able to collapse the date/time kernels since we're 
just moving memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8898) [C++] Determine desirable maximum length for ExecBatch in pipelined and parallel execution of kernels

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8898:
---

 Summary: [C++] Determine desirable maximum length for ExecBatch in 
pipelined and parallel execution of kernels
 Key: ARROW-8898
 URL: https://issues.apache.org/jira/browse/ARROW-8898
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Maximum lengths like 16K or 64K seem to be popular, but we should write our own 
benchmarks so that we can justify the choice of default chunksize



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8897) [C++] Determine strategy for propagating failures in initializing built-in function registry in arrow/compute

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8897:
---

 Summary: [C++] Determine strategy for propagating failures in 
initializing built-in function registry in arrow/compute
 Key: ARROW-8897
 URL: https://issues.apache.org/jira/browse/ARROW-8897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


As discussed on https://github.com/apache/arrow/pull/7240, we are using 
{{DCHECK_OK}} to check statuses when initializing the built-in registry. 

We could propagate failures by changing {{arrow::compute::GetFunctionRegistry}} 
to return Result, but there may be other ways



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8896) [C++] Reimplement dictionary unpacking in Cast kernels using Take

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8896:
---

 Summary: [C++] Reimplement dictionary unpacking in Cast kernels 
using Take
 Key: ARROW-8896
 URL: https://issues.apache.org/jira/browse/ARROW-8896
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


As suggested by [~apitrou] this should yield less code to maintain



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8895) [C++] Add C++ unit tests for filter function on temporal type inputs, including timestamps

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8895:
---

 Summary: [C++] Add C++ unit tests for filter function on temporal 
type inputs, including timestamps
 Key: ARROW-8895
 URL: https://issues.apache.org/jira/browse/ARROW-8895
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


These are used in R but not tested in C++, so I only found out that I had 
missed adding the kernels to the Filter VectorFunction when running the R test 
suite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8894) [C++] C++ array kernels framework and execution buildout (umbrella issue)

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8894:
---

 Summary: [C++] C++ array kernels framework and execution buildout 
(umbrella issue)
 Key: ARROW-8894
 URL: https://issues.apache.org/jira/browse/ARROW-8894
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


In the wake of ARROW-8792, this issue is to serve as an umbrella issue for 
follow up work and associated "buildout" which includes things like:

* Implementation of many new function types and adding new kernel cases to 
existing functions
* Adding implicit casting functionality to function execution
* Creation of "bound" physical arrays expressions
* Pipeline execution (executing multiple kernels while eliminating temporary 
allocation)
* Parallel execution of scalar and aggregate kernels (including parallel 
execution of pipelined kernels)

There's quite a few existing JIRAs in the project that I'll attach to this 
issue and I'll open plenty more issues as things occur to me to help organize 
the work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8893) [R] Fix cpplint issues introduced by ARROW-8885

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8893:
---

 Summary: [R] Fix cpplint issues introduced by ARROW-8885
 Key: ARROW-8893
 URL: https://issues.apache.org/jira/browse/ARROW-8893
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
(arrow-3.7) 12:34 ~/code/arrow/r $ ./lint.sh 
/home/wesm/code/arrow/r/src/arrow_types.h:20:  Include the directory when 
naming .h files  [build/include_subdir] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:66:  Add #include  for 
forward  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:83:  Add #include  for 
vector<>  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:95:  Add #include  for 
numeric_limits<>  [build/include_what_you_use] [4]
/home/wesm/code/arrow/r/src/arrow_types.h:110:  Add #include  for 
shared_ptr<>  [build/include_what_you_use] [4]

/home/wesm/code/arrow/r/src/arrow_exports.h:22:  Include the directory when 
naming .h files  [build/include_subdir] [4]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8892) [C++][CI] CI builds for MSVC do not build benchmarks

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8892:
---

 Summary: [C++][CI] CI builds for MSVC do not build benchmarks
 Key: ARROW-8892
 URL: https://issues.apache.org/jira/browse/ARROW-8892
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We must ensure that our benchmarks always build on Windows

I'm fixing these errors for example in ARROW-8792

{code}
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): error 
C2220: warning treated as error - no 'object' file generated
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(256): note: see 
reference to function template instantiation 'void 
parquet::BM_PlainEncodingSpaced(benchmark::State &)' 
being compiled
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(249): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(292): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(306): note: see 
reference to function template instantiation 'void 
parquet::BM_PlainDecodingSpaced(benchmark::State &)' 
being compiled
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(299): warning 
C4244: 'argument': conversion from 'int64_t' to 'int', possible loss of data
C:/Users/wesmc/code/arrow/cpp/src/parquet/encoding_benchmark.cc(300): warning 
C4244: 'argument': conversion from 'const int64_t' to 'int', possible loss of 
data
[11/67] Linking CXX executable release\arrow-ipc-read-write-benchmark.exe
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8891) [C++] Split non-cast compute kernels into a separate shared library

2020-05-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8891:
---

 Summary: [C++] Split non-cast compute kernels into a separate 
shared library
 Key: ARROW-8891
 URL: https://issues.apache.org/jira/browse/ARROW-8891
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Since we are going to implement a lot more precompiled kernels, I am not sure 
it makes sense to require all of them to be compiled unconditionally just to 
get access to {{compute::Cast}}, which is needed in many different contexts.

After ARROW-8792 is merged, I would suggest creating a plugin hook for adding a 
bundle of kernels from a shared library outside of libarrow.so, and then moving 
all the object code outside of Cast to something like libarrow_compute.so. Then 
we can change the CMake flags to compile Cast kernels always (?) and then opt 
in to building the additional kernels package separately



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8876) [C++] Implement casts from date types to Timestamp

2020-05-20 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8876:
---

 Summary: [C++] Implement casts from date types to Timestamp
 Key: ARROW-8876
 URL: https://issues.apache.org/jira/browse/ARROW-8876
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Discovered the absence of this while refactoring cast.cc. Since we can cast 
Timestamp -> date, we should be able to cast the other way



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8866) [C++] Split Type::UNION into Type::SPARSE_UNION and Type::DENSE_UNION

2020-05-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8866:
---

 Summary: [C++] Split Type::UNION into Type::SPARSE_UNION and 
Type::DENSE_UNION
 Key: ARROW-8866
 URL: https://issues.apache.org/jira/browse/ARROW-8866
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Similar to the recent {{Type::INTERVAL}} split, having these two array types 
which have different memory layouts under the same {{Type::type}} value makes 
function dispatch somewhat more complicated. This issue is less critical from 
INTERVAL so this may not be urgent but seems like a good pre-10 change



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8823) [C++] Compute aggregate compression ratio when producing compressed IPC body messages

2020-05-16 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8823:
---

 Summary: [C++] Compute aggregate compression ratio when producing 
compressed IPC body messages
 Key: ARROW-8823
 URL: https://issues.apache.org/jira/browse/ARROW-8823
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


It would be beneficial to know the exact bytes-on-wire savings once the message 
has been produced. Since this computation would be relatively trivial it would 
not add overhead to the IPC write hot path. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8800) [C++] Split arrow::ChunkedArray into arrow/chunked_array.h

2020-05-14 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8800:
---

 Summary: [C++] Split arrow::ChunkedArray into arrow/chunked_array.h
 Key: ARROW-8800
 URL: https://issues.apache.org/jira/browse/ARROW-8800
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are plenty of scenarios where ChunkedArray is used separate from Table, 
it would probably make sense to split up the headers, implementation, and unit 
tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline

2020-05-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8793:
---

 Summary: [C++] BitUtil::SetBitsTo probably doesn't need to be 
inline
 Key: ARROW-8793
 URL: https://issues.apache.org/jira/browse/ARROW-8793
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Inlining this function probably does not yield meaningful performance benefits



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8792) [C++] Improved declarative compute function / kernel development framework, normalize calling conventions

2020-05-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8792:
---

 Summary: [C++] Improved declarative compute function / kernel 
development framework, normalize calling conventions
 Key: ARROW-8792
 URL: https://issues.apache.org/jira/browse/ARROW-8792
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I'm working on a significant revamp of the way that kernels are implemented in 
the project as discussed on the mailing list. PR to follow within the next week 
or sooner

A brief list of features:

* Kernel selection that takes into account the shape of inputs (whether Scalar 
or Array, so you can provide an implementation just for Arrays and a separate 
one just for Scalars if you want)
* More customizable / less monolithic type-to-kernel dispatch
* Browsable function registry (see all available kernels and their input type 
signatures)
* Central code path for type-checking and argument validation
* Central code path for kernel execution on ChunkedArray inputs

There's a lot of JIRAs in the backlog that will follow from this work so I will 
attach those to this issue for visibility but this issue will cover the initial 
refactoring work to port the existing code to the new framework without 
altering existing features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8769) [C++] Add convenience methods to access fields by name in StructScalar

2020-05-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8769:
---

 Summary: [C++] Add convenience methods to access fields by name in 
StructScalar
 Key: ARROW-8769
 URL: https://issues.apache.org/jira/browse/ARROW-8769
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This would improve usability of this type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8762) [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation

2020-05-11 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8762:
---

 Summary: [C++][Gandiva] Replace Gandiva's BitmapAnd with common 
implementation
 Key: ARROW-8762
 URL: https://issues.apache.org/jira/browse/ARROW-8762
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


Now that the arrow/util/bit_util.h implementation has been optimized, we should 
just use that one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8750) [Python] pyarrow.feather.write_feather does not default to lz4 compression if it's available

2020-05-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8750:
---

 Summary: [Python] pyarrow.feather.write_feather does not default 
to lz4 compression if it's available
 Key: ARROW-8750
 URL: https://issues.apache.org/jira/browse/ARROW-8750
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0, 0.17.1


This was my intention but I seem to have implemented it incorrectly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8746) [Python][Documentation] Add column limit recommendations Parquet page

2020-05-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8746:
---

 Summary: [Python][Documentation] Add column limit recommendations 
Parquet page
 Key: ARROW-8746
 URL: https://issues.apache.org/jira/browse/ARROW-8746
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Reporter: Wes McKinney


Users would be well advised to not write columns with large numbers (> 1000) of 
columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8727) [C++] Do not require struct-initialization of StringConverter to parse strings to other types

2020-05-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8727:
---

 Summary: [C++] Do not require struct-initialization of 
StringConverter to parse strings to other types
 Key: ARROW-8727
 URL: https://issues.apache.org/jira/browse/ARROW-8727
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I ran into this issue while working on refactoring kernels. 
{{StringConverter}} must be initialized to be able to support parametric 
types like Timestamp, but this produces an awkwardness and possibly a 
performance penalty (I haven't measured yet) in inlined functions. 

In any case, I'm refactoring everything to be static non-stateful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8711) [Python] Expose strptime timestamp parsing in read_csv conversion options

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8711:
---

 Summary: [Python] Expose strptime timestamp parsing in read_csv 
conversion options
 Key: ARROW-8711
 URL: https://issues.apache.org/jira/browse/ARROW-8711
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-8111



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8712) [R] Expose strptime timestamp parsing in read_csv conversion options

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8712:
---

 Summary: [R] Expose strptime timestamp parsing in read_csv 
conversion options
 Key: ARROW-8712
 URL: https://issues.apache.org/jira/browse/ARROW-8712
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-8111



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8706) [C++][Parquet] Tracking JIRA for PARQUET-1857 (unencrypted INT16_MAX Parquet row group limit)

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8706:
---

 Summary: [C++][Parquet] Tracking JIRA for PARQUET-1857 
(unencrypted INT16_MAX Parquet row group limit)
 Key: ARROW-8706
 URL: https://issues.apache.org/jira/browse/ARROW-8706
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0, 0.17.1


JIRA to make sure this patch gets included in a patch release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8700) [C++] static libgflags.a fails to link properly in gcc 4.x

2020-05-04 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8700:
---

 Summary: [C++] static libgflags.a fails to link properly in gcc 4.x
 Key: ARROW-8700
 URL: https://issues.apache.org/jira/browse/ARROW-8700
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I am seeing this with gcc 4.8 on Ubuntu 18.04

{code}
$ ninja
[55/179] Linking CXX executable release/arrow-json-integration-test
FAILED: release/arrow-json-integration-test 
: && /usr/bin/ccache /usr/bin/g++-4.8  -O3 -DNDEBUG  -Wall -Wno-attributes 
-msse4.2  -O3 -DNDEBUG  -rdynamic 
src/arrow/ipc/CMakeFiles/arrow-json-integration-test.dir/json_integration_test.cc.o
  -o release/arrow-json-integration-test  
-Wl,-rpath,/home/wesm/code/arrow/cpp/build-4.8/release 
release/libarrow_testing.so.18.0.0 release/libarrow.so.18.0.0 -ldl 
release//libgtest_main.so release//libgtest.so release//libgmock.so 
boost_ep-prefix/src/boost_ep/stage/lib/libboost_filesystem.a 
boost_ep-prefix/src/boost_ep/stage/lib/libboost_system.a -ldl 
../bundled/gflags_ep-prefix/src/gflags_ep/lib/libgflags.a 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt 
-lpthread && :
src/arrow/ipc/CMakeFiles/arrow-json-integration-test.dir/json_integration_test.cc.o:
 In function `_GLOBAL__sub_I__ZN3fLS11FLAGS_arrowE':
json_integration_test.cc:(.text.startup+0x1cc): undefined reference to 
`google::FlagRegisterer::FlagRegisterer(char const*, char const*, 
char const*, std::string*, std::string*)'
json_integration_test.cc:(.text.startup+0x275): undefined reference to 
`google::FlagRegisterer::FlagRegisterer(char const*, char const*, 
char const*, std::string*, std::string*)'
json_integration_test.cc:(.text.startup+0x317): undefined reference to 
`google::FlagRegisterer::FlagRegisterer(char const*, char const*, 
char const*, std::string*, std::string*)'
collect2: error: ld returned 1 exit status
[88/179] Building CXX object 
src/arrow/ipc/CMakeFiles/arrow-ipc-read-write-test.dir/read_write_test.cc.o
ninja: build stopped: subcommand failed.
{code}

CMake invocation

{code}
$ cmake .. -GNinja -DARROW_GANDIVA=ON -DARROW_CSV=ON -DARROW_BUILD_TESTS=ON 
-DARROW_BUILD_BENCHMARKS=ON
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8684) [Packaging][Python] "SystemError: Bad call flags in _PyMethodDef_RawFastCallDict" in Python 3.7.7 on macOS when using pyarrow wheel

2020-05-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8684:
---

 Summary: [Packaging][Python] "SystemError: Bad call flags in 
_PyMethodDef_RawFastCallDict" in Python 3.7.7 on macOS when using pyarrow wheel
 Key: ARROW-8684
 URL: https://issues.apache.org/jira/browse/ARROW-8684
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


[~npr] reported this on the 0.17.0 RC0 vote thread but I have confirmed it 
independently. It was also reported at

https://github.com/apache/arrow/issues/7082

Here are steps to reproduce on macOS:

{code}
conda create -yn py-3.7-defaults python=3.7 -c defaults
conda activate py-3.7-defaults
pip install pyarrow
{code}

Now open the Python interpreter, run {{import pyarrow}}, then exit the 
interpreter ({{python -c "import pyarrow"}} didn't trigger it for me):

{code}
$ python
Python 3.7.7 (default, Mar 26 2020, 10:32:53) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> 
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "pyarrow/types.pxi", line 2638, in 
pyarrow.lib._unregister_py_extension_types
SystemError: Bad call flags in _PyMethodDef_RawFastCallDict. METH_OLDARGS is no 
longer supported!
Segmentation fault: 11
{code}

It fails with Python 3.7.6 when using {{-c conda-forge}} also, so it is not 
particular to defaults.

Frustratingly, the problem doesn't exist in Python 3.7.4 but occurs for me with 
3.7.5, 3.7.6, and 3.7.7. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8683) [C++] Add option for user-defined version identifier for Arrow libraries

2020-05-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8683:
---

 Summary: [C++] Add option for user-defined version identifier for 
Arrow libraries
 Key: ARROW-8683
 URL: https://issues.apache.org/jira/browse/ARROW-8683
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


It would be useful to be able to "watermark" shared libraries with e.g. the git 
hash to determine the exact origin of a particular build of the project. The 
version identifier could default to the current git revision but be overridden 
in the CMake invocation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8676) [Rust] Create implementation of IPC RecordBatch body buffer compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8676:
---

 Summary: [Rust] Create implementation of IPC RecordBatch body 
buffer compression from ARROW-300
 Key: ARROW-8676
 URL: https://issues.apache.org/jira/browse/ARROW-8676
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8674) [JS] Implement IPC RecordBatch body buffer compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8674:
---

 Summary: [JS] Implement IPC RecordBatch body buffer compression 
from ARROW-300
 Key: ARROW-8674
 URL: https://issues.apache.org/jira/browse/ARROW-8674
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: JavaScript
Reporter: Wes McKinney


This may not be a hard requirement for JS because this would require pulling in 
implementations of LZ4 and ZSTD which not all users may want



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8675) [C#] Create implementation of ARROW-300 / IPC record batch body buffer compression

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8675:
---

 Summary: [C#] Create implementation of ARROW-300 / IPC record 
batch body buffer compression
 Key: ARROW-8675
 URL: https://issues.apache.org/jira/browse/ARROW-8675
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C#
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8673) [Go] Implement IPC RecordBatch body compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8673:
---

 Summary: [Go] Implement IPC RecordBatch body compression from 
ARROW-300
 Key: ARROW-8673
 URL: https://issues.apache.org/jira/browse/ARROW-8673
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8671) [C++] Use IPC body compression metadata approved in ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8671:
---

 Summary: [C++] Use IPC body compression metadata approved in 
ARROW-300 
 Key: ARROW-8671
 URL: https://issues.apache.org/jira/browse/ARROW-8671
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This will adapt the existing code to use the new metadata, while maintaining 
backward compatibility code to recognize the "experimental" metadata written in 
0.17.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8672) [Java] Implement RecordBatch IPC buffer compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8672:
---

 Summary: [Java] Implement RecordBatch IPC buffer compression from 
ARROW-300
 Key: ARROW-8672
 URL: https://issues.apache.org/jira/browse/ARROW-8672
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8670) [Format] Create reference implementations of IPC RecordBatch body compression from ARROW-300

2020-05-02 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8670:
---

 Summary: [Format] Create reference implementations of IPC 
RecordBatch body compression from ARROW-300 
 Key: ARROW-8670
 URL: https://issues.apache.org/jira/browse/ARROW-8670
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


Tracking JIRA for implementing ARROW-300 in different PLs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8667) [C++] Add multi-consumer Scheduler API to sit one layer above ThreadPool

2020-05-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8667:
---

 Summary: [C++] Add multi-consumer Scheduler API to sit one layer 
above ThreadPool
 Key: ARROW-8667
 URL: https://issues.apache.org/jira/browse/ARROW-8667
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I believe we should define an abstraction to allow for custom resource 
allocation strategies (round robin, even time, etc.) to be devised for 
situations where there are different thread pool consumers that are working 
independently of each other.

Consider the classic nested parallelism scenario:

* Task A in thread 1 may issue N subtasks that run in parallel
* Task B in thread 2 may issue K subtasks

With our current ThreadPool abstraction, it is easy to conceive scenarios where 
either Task A or Task B trample each other. 

One approach to remedy this problem is to have an API like so:

{code}
// Inform the scheduler that you want to submit tasks that are "your tasks"
int consumer_id = scheduler->NewConsumer();

for (...) {
  Future fut = scheduler->Submit(consumer_id, DoWork, ...);
}

scheduler->FinishConsumer(consumer_id);
{code}

The idea is that the scheduler would maintain separate task queues for each 
consumer and e.g. track consumer-specific metrics of interest to determine how 
tasks are allocated.

The scheduler could have different logic to control tasks being assigned to 
worker threads:

* Round-robin
* Even-time allocation (run fewer tasks for consumers with "slow" tasks and 
more tasks from consumers with "fast" tasks -- though there are some nuances 
here like avoiding starving a consumer if they've been doing a lot of "slow" 
tasks and then a "fast" consumer shows up)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8661) [C++][Gandiva] Reduce number of files and headers

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8661:
---

 Summary: [C++][Gandiva] Reduce number of files and headers
 Key: ARROW-8661
 URL: https://issues.apache.org/jira/browse/ARROW-8661
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


I feel that the Gandiva subpackage is more Java-like in its code organization 
than the rest of the Arrow codebase, and it might be easier to navigate and 
develop with closely related code condensed into some larger headers and 
compilation units.

Additionally, it's not necessary to have a header file for each component of 
the function registry -- the registration functions can be declared in 
function_registry.h or function_registry_internal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8660) [C++][Gandiva] Reduce dependence on Boost

2020-04-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8660:
---

 Summary: [C++][Gandiva] Reduce dependence on Boost
 Key: ARROW-8660
 URL: https://issues.apache.org/jira/browse/ARROW-8660
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Remove Boost usages aside from Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8635) [R] test-filesystem.R takes ~40 seconds to run?

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8635:
---

 Summary: [R] test-filesystem.R takes ~40 seconds to run?
 Key: ARROW-8635
 URL: https://issues.apache.org/jira/browse/ARROW-8635
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


{code}
✔ |  22   | Expressions
✔ | 107   | Feather [0.2 s]
✔ |   7   | Field
✔ |  40   | File system [38.1 s]
✔ |   6   | install_arrow()
✔ |  26   | JsonTableReader [0.1 s]
✔ |  24   | MessageReader
✔ |  12   | Message
✔ |  31   | Parquet file reading/writing [0.2 s]
⠏ |   0   | To/from Pythonvirtualenv: arrow-test
{code}

Is this expected? I assume it's related to S3 but that seems like a long time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8633) [C++] Add ValidateAscii function

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8633:
---

 Summary: [C++] Add ValidateAscii function
 Key: ARROW-8633
 URL: https://issues.apache.org/jira/browse/ARROW-8633
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


In some cases, we want to be able to check whether it's safe to use functions 
that assume ASCII (like {{std::tolower}}, or {{std::string::substr). This was 
implemented in a PR for ARROW-6131 that was not merged



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8626) [C++] Implement "round robin" scheduler interface to fixed-size ThreadPool

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8626:
---

 Summary: [C++] Implement "round robin" scheduler interface to 
fixed-size ThreadPool 
 Key: ARROW-8626
 URL: https://issues.apache.org/jira/browse/ARROW-8626
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Currently, when submitting tasks to a thread pool, they are all commingled in a 
common queue. When a new task submitter shows up, they must wait in the back of 
the line behind all other queued tasks.

A simple alternative to this would be round-robin scheduling, where each new 
consumer is assigned a unique integer id, and the schedule / thread pool 
internally maintains the tasks associated with the consumer in separate queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8623) [C++][Gandiva] Reduce use of Boost, remove Boost headers from header files

2020-04-29 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8623:
---

 Summary: [C++][Gandiva] Reduce use of Boost, remove Boost headers 
from header files
 Key: ARROW-8623
 URL: https://issues.apache.org/jira/browse/ARROW-8623
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Gandiva
Reporter: Wes McKinney
 Fix For: 1.0.0


Boost is currently a transitive dependency of many of Gandiva's public header 
files. I suggest the following:

* Do not include Boost transitively in any installed header file
* Reduce usages of Boost altogether

On the latter point, most usages of Boost can be trimmed by having a 
{{hash_combine}} function inside the Arrow codebase. See results of grepping 
the codebase

https://gist.github.com/wesm/190006d91628e6bf7c04deb596a52cff

It seems that Boost cannot be easily eliminated altogether at the present 
moment because of a use of Boost.Multiprecision ({{int256_t}}). At some point 
someone may want to implement sufficient 256-bit integer functions so that we 
don't have to depend on Boost.Multiprecision



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8619) [C++] Use distinct Type::type values for interval types

2020-04-28 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8619:
---

 Summary: [C++] Use distinct Type::type values for interval types
 Key: ARROW-8619
 URL: https://issues.apache.org/jira/browse/ARROW-8619
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is a breaking API change, but {{MonthIntervalType}} and 
{{DayTimeIntervalType}} are different data types (and have different value 
sizes, which is not true of timestamps) and thus should be distinguished in the 
same way that DATE32 / DATE64 are distinguished, or TIME32 / TIME64 are 
distinguished



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >