[jira] [Created] (ARROW-8427) [C++][Dataset] Do not ignore file paths with underscore/dot when full path was specified

2020-04-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8427:


 Summary: [C++][Dataset] Do not ignore file paths with 
underscore/dot when full path was specified
 Key: ARROW-8427
 URL: https://issues.apache.org/jira/browse/ARROW-8427
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Currently, when passing a list of file path to FileSystemDatasetFactory, the 
files that have one of their parent directories with a underscore or dot are 
skipped. Since the file paths were passed as an explicit list, we should maybe 
not skip them.

For example, when specifying a directory (Selector), it will only check child 
directories to skip, not parent directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8426) [Rust] [Parquet] Add support for writing dictionary types

2020-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-8426:
-

 Summary: [Rust] [Parquet] Add support for writing dictionary types
 Key: ARROW-8426
 URL: https://issues.apache.org/jira/browse/ARROW-8426
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8425) [Rust] [Parquet] Add support for writing timestamp types

2020-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-8425:
-

 Summary: [Rust] [Parquet] Add support for writing timestamp types
 Key: ARROW-8425
 URL: https://issues.apache.org/jira/browse/ARROW-8425
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8423) [Rust] [Parquet] Add support for writing integer types

2020-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-8423:
-

 Summary: [Rust] [Parquet] Add support for writing integer types
 Key: ARROW-8423
 URL: https://issues.apache.org/jira/browse/ARROW-8423
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8424) [Rust] [Parquet] Add support for writing floating point types

2020-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-8424:
-

 Summary: [Rust] [Parquet] Add support for writing floating point 
types
 Key: ARROW-8424
 URL: https://issues.apache.org/jira/browse/ARROW-8424
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8422) [Rust] Implement function to convert Arrow schema to Parquet schema

2020-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-8422:
-

 Summary: [Rust] Implement function to convert Arrow schema to 
Parquet schema
 Key: ARROW-8422
 URL: https://issues.apache.org/jira/browse/ARROW-8422
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Andy Grove


Implement function to convert Arrow schema to Parquet schema



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8421) [Rust] [Parquet] Implement parquet writer

2020-04-13 Thread Andy Grove (Jira)
Andy Grove created ARROW-8421:
-

 Summary: [Rust] [Parquet] Implement parquet writer
 Key: ARROW-8421
 URL: https://issues.apache.org/jira/browse/ARROW-8421
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
 Fix For: 1.0.0


This is the parent story. See subtasks for more information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8420) [C++] CMake fails to configure on armv7l platform (e.g. Raspberry Pi 3)

2020-04-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8420:
---

 Summary: [C++] CMake fails to configure on armv7l platform (e.g. 
Raspberry Pi 3)
 Key: ARROW-8420
 URL: https://issues.apache.org/jira/browse/ARROW-8420
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


Related to ARROW-8410, but probably will resolve the ARMv7 issues in a separate 
PR

{code}
$ cmake .. -DARROW_BUILD_TESTS=ON -DARROW_ORC=ON -DARROW_PARQUET=ON 
-DARROW_DEPENDENCY_SOURCE=BUNDLED -GNinja
-- Building using CMake version: 3.13.4
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 1.0.0 (full: '1.0.0-SNAPSHOT')
-- Arrow SO version: 100 (full: 100.0.0)
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29") 
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found Python3: /usr/bin/python3.7 (found version "3.7.3") found components:  
Interpreter 
-- Found cpplint executable at /home/pi/code/arrow/cpp/build-support/cpplint.py
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Failed
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Failed
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Failed
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:318 (message):
  SSE4.2 required but compiler doesn't support it.
Call Stack (most recent call first):
  CMakeLists.txt:399 (include)


-- Configuring incomplete, errors occurred!
See also "/home/pi/code/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
See also "/home/pi/code/arrow/cpp/build/CMakeFiles/CMakeError.log".
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8419) [C++] Default display for multi-choice define_option_string is misleading

2020-04-13 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8419:
---

 Summary: [C++] Default display for multi-choice 
define_option_string is misleading
 Key: ARROW-8419
 URL: https://issues.apache.org/jira/browse/ARROW-8419
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


While working on ARROW-8410, I noticed:

{code}
--   ARROW_SIMD_LEVEL=AVX2 [default=NONE|SSE4_2|AVX2|AVX512]
--   SIMD compiler optimization level
--   ARROW_ARMV8_ARCH=armv8-a+crc+crypto [default=armv8-a|armv8-a+crc+crypto]
--   Arm64 arch and extensions
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8418) [Python] partition_filename_cb in write_to_dataset should be passed additional keyword arguments rather than just keys

2020-04-13 Thread Varun Patil (Jira)
Varun Patil created ARROW-8418:
--

 Summary: [Python] partition_filename_cb in write_to_dataset should 
be passed additional keyword arguments rather than just keys
 Key: ARROW-8418
 URL: https://issues.apache.org/jira/browse/ARROW-8418
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Varun Patil


I recently had a requirement where I would have liked to construct a filename 
based on additional context from Apache Airflow (specifically execution_date).

It would be nice to pass the additional kwargs to *partition_filename_cb* so 
that the filename can be constructed using additional information rather than 
just the keys used for partitioning.

I believe the fix should be as simple as passing kwargs to the 
*partition_filename_cb* inside *write_to_dataset.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8417) [Packaging] Move the manylinux crossbow wheel builds to Githuba actions

2020-04-13 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8417:
--

 Summary: [Packaging] Move the manylinux crossbow wheel builds to 
Githuba actions
 Key: ARROW-8417
 URL: https://issues.apache.org/jira/browse/ARROW-8417
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


To free up some bandwidth on azure for the conda jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8416) [Python] Provide a "feather" alias in the dataset API

2020-04-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8416:


 Summary: [Python] Provide a "feather" alias in the dataset API
 Key: ARROW-8416
 URL: https://issues.apache.org/jira/browse/ARROW-8416
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


I don't know what the plans are on the C++ side (ARROW-7586), but for 0.17, I 
think it would be nice if we can at least support {{ds.dataset(..., 
format="feather")}} (instead of needing to tell people to do {{ds.dataset(..., 
format="ipc")}} to read feather files).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8415) [C++] fix compilation error with GCC 4.8

2020-04-13 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8415:
---

 Summary: [C++] fix compilation error with GCC 4.8
 Key: ARROW-8415
 URL: https://issues.apache.org/jira/browse/ARROW-8415
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8414) [Python] Non-deterministic row order failure in test_parquet.py

2020-04-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8414:


 Summary: [Python] Non-deterministic row order failure in 
test_parquet.py
 Key: ARROW-8414
 URL: https://issues.apache.org/jira/browse/ARROW-8414
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8413) Refactor DefLevelsToBitmap

2020-04-12 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-8413:
--

 Summary: Refactor DefLevelsToBitmap
 Key: ARROW-8413
 URL: https://issues.apache.org/jira/browse/ARROW-8413
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


The current code is should be split apart and made more efficient to 
consolidate logic need to support all nesting combinations.

 

We need to be able to pass in an arbitrary min definitions level to prune away 
elements that aren't included in lists.  

The functionality is also somewhat replicated in reading  the struct code, the 
two paths should be consolidated.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8412) [C++][Gandiva] Fix gandiva date_diff function definitions

2020-04-12 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-8412:
-

 Summary: [C++][Gandiva] Fix gandiva date_diff function definitions
 Key: ARROW-8412
 URL: https://issues.apache.org/jira/browse/ARROW-8412
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda


The current gandiva date functions date_diff, date_sub definitions take integer 
as first argument and date as second argument:

date_diff(10, d) = d - 10, which seems unintuitive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8411) [C++] gcc6 warning re: arrow::internal::ArgSort

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8411:
---

 Summary: [C++] gcc6 warning re: arrow::internal::ArgSort
 Key: ARROW-8411
 URL: https://issues.apache.org/jira/browse/ARROW-8411
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


Observed on a Debian platform with gcc6 base

{code}
In file included from /usr/include/c++/6/algorithm:62:0,
 from ../src/arrow/util/bit_util.h:55,
 from ../src/arrow/type_traits.h:26,
 from ../src/arrow/array.h:32,
 from ../src/arrow/compute/kernel.h:24,
 from ../src/arrow/dataset/filter.h:27,
 from ../src/arrow/dataset/partition.h:27,
 from 
/home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18:
/usr/include/c++/6/bits/stl_algo.h: In function 'void 
std::__insertion_sort(_RandomAccessIterator, _RandomAccessIterator, _Compare) 
[with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Compare = 
__gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = 
std::less >]:: >]':
/usr/include/c++/6/bits/stl_algo.h:1837:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
 __insertion_sort(_RandomAccessIterator __first,
 ^~~~
/usr/include/c++/6/bits/stl_algo.h:1837:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
In file included from /usr/include/c++/6/bits/stl_algo.h:61:0,
 from /usr/include/c++/6/algorithm:62,
 from ../src/arrow/util/bit_util.h:55,
 from ../src/arrow/type_traits.h:26,
 from ../src/arrow/array.h:32,
 from ../src/arrow/compute/kernel.h:24,
 from ../src/arrow/dataset/filter.h:27,
 from ../src/arrow/dataset/partition.h:27,
 from 
/home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18:
/usr/include/c++/6/bits/stl_heap.h: In function 'void 
std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) 
[with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Distance = int; _Tp = long long int; _Compare = 
__gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = 
std::less >]:: >]':
/usr/include/c++/6/bits/stl_heap.h:209:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
 __adjust_heap(_RandomAccessIterator __first, _Distance __holeIndex,
 ^
In file included from /usr/include/c++/6/algorithm:62:0,
 from ../src/arrow/util/bit_util.h:55,
 from ../src/arrow/type_traits.h:26,
 from ../src/arrow/array.h:32,
 from ../src/arrow/compute/kernel.h:24,
 from ../src/arrow/dataset/filter.h:27,
 from ../src/arrow/dataset/partition.h:27,
 from 
/home/rock/code/arrow/cpp/src/arrow/dataset/partition.cc:18:
/usr/include/c++/6/bits/stl_algo.h: In function 'void 
std::__introsort_loop(_RandomAccessIterator, _RandomAccessIterator, _Size, 
_Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator >; _Size = int; _Compare = 
__gnu_cxx::__ops::_Iter_comp_iter&, Cmp&&) [with T = std::__cxx11::basic_string; Cmp = 
std::less >]:: >]':
/usr/include/c++/6/bits/stl_algo.h:1937:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
 __introsort_loop(_RandomAccessIterator __first,
 ^~~~
/usr/include/c++/6/bits/stl_algo.h:1937:5: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
/usr/include/c++/6/bits/stl_algo.h:1951:4: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
std::__introsort_loop(__cut, __last, __depth_limit, __comp);
^~~
/usr/include/c++/6/bits/stl_algo.h: In function 'std::vector 
arrow::internal::ArgSort(const std::vector&, Cmp&&) [with T = 
std::__cxx11::basic_string; Cmp = 
std::less >]':
/usr/include/c++/6/bits/stl_algo.h:1882:4: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
std::__insertion_sort(__first, __first + int(_S_threshold), __comp);
^~~
/usr/include/c++/6/bits/stl_algo.h:1887:2: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
  std::__insertion_sort(__first, __last, __comp);
  ^~~
/usr/include/c++/6/bits/stl_algo.h:1965:4: note: parameter passing for argument 
of type '__gnu_cxx::__normal_iterator >' will change in GCC 7.1
std::__introsort_loop(__first, __last,
^~~
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8410) [C++] CMake fails on aarch64 systems that do not support -march=armv8-a+crc+crypto

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8410:
---

 Summary: [C++] CMake fails on aarch64 systems that do not support 
-march=armv8-a+crc+crypto
 Key: ARROW-8410
 URL: https://issues.apache.org/jira/browse/ARROW-8410
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


I was trying to build the project on a rockpro64 system to look into something 
else and ran into this

{code}
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:332 (message):
  Unsupported arch flag: -march=armv8-a+crc+crypto.
Call Stack (most recent call first):
  CMakeLists.txt:398 (include)

-- Configuring incomplete, errors occurred!
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8409) [R] Add arrow::cpu_count, arrow::set_cpu_count wrapper functions a la Python

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8409:
---

 Summary: [R] Add arrow::cpu_count, arrow::set_cpu_count wrapper 
functions a la Python
 Key: ARROW-8409
 URL: https://issues.apache.org/jira/browse/ARROW-8409
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
 Fix For: 0.17.0


While some people will configure these with {{$OMP_NUM_THREADS}}, it is useful 
to be able to configure the global thread pool dynamically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8408) [Python] Add memory_map= toggle to pyarrow.feather.read_feather

2020-04-12 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8408:
---

 Summary: [Python] Add memory_map= toggle to 
pyarrow.feather.read_feather
 Key: ARROW-8408
 URL: https://issues.apache.org/jira/browse/ARROW-8408
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.17.0


I missed this in my prior patch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8407) [Rust] Add rustdoc for Dictionary type

2020-04-12 Thread Andy Grove (Jira)
Andy Grove created ARROW-8407:
-

 Summary: [Rust] Add rustdoc for Dictionary type
 Key: ARROW-8407
 URL: https://issues.apache.org/jira/browse/ARROW-8407
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.17.0


Add rustdoc for Dictionary type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8406) [Python] FileSystem.from_uri erases the drive on Windows

2020-04-12 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8406:
--

 Summary: [Python] FileSystem.from_uri erases the drive on Windows
 Key: ARROW-8406
 URL: https://issues.apache.org/jira/browse/ARROW-8406
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Krisztian Szucs


{code:python}
path = 
"C:\Users\VssAdministrator\AppData\Local\Temp\pytest-of-VssAdministrator\pytest-0\test_construct_from_single_fil0\single-file"
_, path = FileSystem.from_uri(path)
path == 
"/Users/VssAdministrator/AppData/Local/Temp/pytest-of-VssAdministrator/pytest-0/test_construct_from_single_fil0/single-file"
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8405) [Gandiva][UDF] Support complex datatype for UDF return type.

2020-04-12 Thread ZMZ91 (Jira)
ZMZ91 created ARROW-8405:


 Summary: [Gandiva][UDF] Support complex datatype for UDF return 
type.
 Key: ARROW-8405
 URL: https://issues.apache.org/jira/browse/ARROW-8405
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: ZMZ91


Is it possible to return a complex datatype for a UDF, like vector or event 
dictionary? Checked 
[https://github.com/apache/arrow/blob/master/cpp/src/gandiva/precompiled/types.h]
 and found the types used there are all basic datatypes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8404) Read and write dataset description in both R and Python

2020-04-11 Thread Vincent Nijs (Jira)
Vincent Nijs created ARROW-8404:
---

 Summary: Read and write dataset description in both R and Python
 Key: ARROW-8404
 URL: https://issues.apache.org/jira/browse/ARROW-8404
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Vincent Nijs


Below a feature request for feather. Wes suggested opening an issue here. The 
idea is to add metadata to a data frame to store and display information about 
the data (e.g., variable descriptions, data source, main company contact about 
data, changes, etc. etc.). For a simple example in R (+ shiny) that uses a 
"description" attribute in markdown format and then renders it in HTML when 
loaded, see the link below. See the description for the diamonds data.

[https://vnijs.shinyapps.io/radiant]

Having a data format that works for both R and Python *and* maintains 
attributes like a data description would be great!

[https://github.com/wesm/feather/issues/328]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8403) [C++] Add ToString() to ChunkedArray, Table and RecordBatch

2020-04-11 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8403:
---

 Summary: [C++] Add ToString() to ChunkedArray, Table and 
RecordBatch
 Key: ARROW-8403
 URL: https://issues.apache.org/jira/browse/ARROW-8403
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8402) [Java] Support ValidateFull methods in Java

2020-04-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8402:
---

 Summary: [Java] Support ValidateFull methods in Java
 Key: ARROW-8402
 URL: https://issues.apache.org/jira/browse/ARROW-8402
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need to support ValidateFull methods in Java, just like we do in C++. 
This is required by ARROW-5926.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8401) [C++] Add AVX2/AVX512 version of ByteStreamSplitDecode/ByteStreamSplitEncode

2020-04-10 Thread Frank Du (Jira)
Frank Du created ARROW-8401:
---

 Summary: [C++] Add AVX2/AVX512 version of 
ByteStreamSplitDecode/ByteStreamSplitEncode
 Key: ARROW-8401
 URL: https://issues.apache.org/jira/browse/ARROW-8401
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Frank Du
Assignee: Frank Du


Add AVX2/AVX512 version of ByteStreamSplitDecode/ByteStreamSplitEncode, it 
should similar to the SSE implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8400) [Python][Dataset] Infer the filesystem from the first path if multiple paths are passed to dataset()

2020-04-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8400:
--

 Summary: [Python][Dataset] Infer the filesystem from the first 
path if multiple paths are passed to dataset()
 Key: ARROW-8400
 URL: https://issues.apache.org/jira/browse/ARROW-8400
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs


See conversation https://github.com/apache/arrow/pull/6505#discussion_r406677317



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8399) [Rust] Extend memory alignments to include other architectures

2020-04-10 Thread Mahmut Bulut (Jira)
Mahmut Bulut created ARROW-8399:
---

 Summary: [Rust] Extend memory alignments to include other 
architectures
 Key: ARROW-8399
 URL: https://issues.apache.org/jira/browse/ARROW-8399
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Mahmut Bulut
Assignee: Mahmut Bulut


Currently, alignment of allocation is fixed with 64 and this enables most of 
the architectures, but not all L1D prefetching systems are the same and some of 
the architectures are using double prefetching like x86_64. Include a matrix of 
alignment values to extend the cache alignments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8398) [Python] Remove deprecation warnings originating from python tests

2020-04-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8398:
--

 Summary: [Python] Remove deprecation warnings originating from 
python tests
 Key: ARROW-8398
 URL: https://issues.apache.org/jira/browse/ARROW-8398
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


See build log 
https://travis-ci.org/github/ursa-labs/crossbow/builds/673385834#L6846



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8397) [C++] Fail to compile aggregate_test.cc on Ubuntu 16.04

2020-04-10 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8397:
--

 Summary: [C++] Fail to compile aggregate_test.cc on Ubuntu 16.04
 Key: ARROW-8397
 URL: https://issues.apache.org/jira/browse/ARROW-8397
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


See build log 
https://app.circleci.com/pipelines/github/ursa-labs/crossbow/31122/workflows/b250d378-52a8-4d15-9909-96474fa38482/jobs/10840



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8396) [Rust] Remove libc from dependencies

2020-04-10 Thread Mahmut Bulut (Jira)
Mahmut Bulut created ARROW-8396:
---

 Summary: [Rust] Remove libc from dependencies
 Key: ARROW-8396
 URL: https://issues.apache.org/jira/browse/ARROW-8396
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Mahmut Bulut


Code has been removed that use libc calls but dependency sits in there. We can 
remove it before the next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8395) [Python] conda install pyarrow defaults to 0.11.1 not 0.16.0

2020-04-10 Thread dwang (Jira)
dwang created ARROW-8395:


 Summary: [Python] conda install pyarrow defaults to 0.11.1 not 
0.16.0
 Key: ARROW-8395
 URL: https://issues.apache.org/jira/browse/ARROW-8395
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
 Environment: ubuntu 16, ubuntu 18, anaconda 2020.02 x64
Reporter: dwang


When install pyarrow in clean linux conda environment (2020.02):
{code:java}
conda install -c conda-forge pyarrow

The following packages will be downloaded:package|  
  build
---|-
arrow-cpp-0.11.1   |py37h0e61e49_1004 6.3 MB  conda-forge
boost-cpp-1.68.0   |h11c811c_100020.5 MB  conda-forge
conda-4.8.3|   py37hc8dfbb8_1 3.0 MB  conda-forge
libprotobuf-3.6.1  |hdbcaa40_1001 4.0 MB  conda-forge
parquet-cpp-1.5.1  |3   3 KB  conda-forge
pyarrow-0.11.1 |py37hbbcf98d_1002 2.0 MB  conda-forge
python_abi-3.7 |  1_cp37m   4 KB  conda-forge
thrift-cpp-0.12.0  |h0a07b25_1002 2.4 MB  conda-forge

   Total:38.2 MB
{code}
The default version is pyarrow-0.11.1, while conda repo actually has the latest 
version 0.16.0 ( [https://anaconda.org/conda-forge/pyarrow] ).

 

Specify the version does not help:

conda install -c conda-forge pyarrow=0.16.0

 

 

Workaround:

I have to manually download below packages from conda then install them locally:

arrow-cpp-0.16.0-py37hb0edad2_0.tar.bz2
aws-sdk-cpp-1.7.164-h1f8afcc_0.tar.bz2
boost-cpp-1.70.0-h8e57a91_2.tar.bz2
brotli-1.0.7-he1b5a44_1000.tar.bz2
c-ares-1.15.0-h516909a_1001.tar.bz2
gflags-2.2.2-he1b5a44_1002.tar.bz2
glog-0.4.0-he1b5a44_1.tar.bz2
grpc-cpp-1.25.0-h213be95_2.tar.bz2
libprotobuf-3.11.3-h8b12597_0.tar.bz2
lz4-c-1.8.3-he1b5a44_1001.tar.bz2
parquet-cpp-1.5.1-1.tar.bz2
pyarrow-0.16.0-py37h8b68381_1.tar.bz2
re2-2020.01.01-he1b5a44_0.tar.bz2
snappy-1.1.8-he1b5a44_1.tar.bz2
thrift-cpp-0.12.0-hf3afdfd_1004.tar.bz2
zstd-1.4.4-h3b9ef0a_1.tar.bz2

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8394) Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-04-10 Thread Shyamal Shukla (Jira)
Shyamal Shukla created ARROW-8394:
-

 Summary: Typescript compiler errors for arrow d.ts files, when 
using es2015-esm package
 Key: ARROW-8394
 URL: https://issues.apache.org/jira/browse/ARROW-8394
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.16.0
Reporter: Shyamal Shukla


Attempting to use apache-arrow within a web application, but typescript 
compiler throws the following errors in some of arrow's .d.ts files

import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
export class SomeClass {
.
.
constructor() {
const t = Table.from('');
}
*node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: Class 
static side 'typeof Column' incorrectly extends base class static side 'typeof 
Chunked'. Types of property 'new' are incompatible.

*node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
Subsequent property declarations must have the same type. Property 'schema' 
must be of type 'Schema', but here has type 'Schema'.

238 schema: Schema;

*node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error TS2430: 
Interface 'RecordBatch' incorrectly extends interface 'StructVector'. The types 
of 'slice(...).clone' are incompatible between these types.

the tsconfig.json file looks like

{
 "compilerOptions": {
 "target":"ES6",
 "outDir": "dist",
 "baseUrl": "src/"
 },
 "exclude": ["dist"],
 "include": ["src/*.ts"]
}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8393) [C++][Gandiva] Make gandiva function registry case-insensitive

2020-04-10 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-8393:
-

 Summary: [C++][Gandiva] Make gandiva function registry 
case-insensitive
 Key: ARROW-8393
 URL: https://issues.apache.org/jira/browse/ARROW-8393
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Projjal Chanda
Assignee: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison

2020-04-09 Thread Liya Fan (Jira)
Liya Fan created ARROW-8392:
---

 Summary: [Java] Fix overflow related corner cases for vector value 
comparison
 Key: ARROW-8392
 URL: https://issues.apache.org/jira/browse/ARROW-8392
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


1. Fix corner cases related to overflow.
2. Provide test cases for the corner cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8391) [C++] Implement row range read API for IPC file (and Feather)

2020-04-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8391:
---

 Summary: [C++] Implement row range read API for IPC file (and 
Feather)
 Key: ARROW-8391
 URL: https://issues.apache.org/jira/browse/ARROW-8391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


The objective would be able to read a range of rows from the middle of a file. 
It's not as easy as it might sound since all the record batch metadata must be 
examined to determine the start and end point of the row range



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8390) [R] Expose schema unification features

2020-04-09 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8390:
--

 Summary: [R] Expose schema unification features
 Key: ARROW-8390
 URL: https://issues.apache.org/jira/browse/ARROW-8390
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8389) [Integration] Run tests in parallel

2020-04-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8389:
-

 Summary: [Integration] Run tests in parallel
 Key: ARROW-8389
 URL: https://issues.apache.org/jira/browse/ARROW-8389
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Integration
Reporter: Antoine Pitrou


This follows ARROW-8176.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8388) [C++] GCC 4.8 fails to move on return

2020-04-09 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8388:
---

 Summary: [C++] GCC 4.8 fails to move on return
 Key: ARROW-8388
 URL: https://issues.apache.org/jira/browse/ARROW-8388
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 0.17.0


See https://github.com/apache/arrow/pull/6883#issuecomment-611661733

This is a recurring problem which usually shows up as a broken nightly (the 
gandiva nightly jobs, specifically) along with similar issues due to gcc 4.8's 
incomplete handling of c++11. As long as someone depends on these we should 
probably have an every-commit CI job which checks we haven't introduced such a 
breakage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8387) [rust] Make schema_to_fb public because it is very useful!

2020-04-09 Thread Max Burke (Jira)
Max Burke created ARROW-8387:


 Summary: [rust] Make schema_to_fb public because it is very useful!
 Key: ARROW-8387
 URL: https://issues.apache.org/jira/browse/ARROW-8387
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Max Burke


Make schema_to_fb public because it is very useful!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays

2020-04-09 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-8386:
---

 Summary: [Python] pyarrow.jvm raises error for empty Arrays
 Key: ARROW-8386
 URL: https://issues.apache.org/jira/browse/ARROW-8386
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


In the pyarrow.jvm module, when there is an empty array in Java, trying to 
create it in python raises a ValueError. This is because for an empty array, 
Java returns an empty list of buffers, then pyarrow.jvm attempts to create the 
array with pa.Array.from_buffers with an empty list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8385) Crash on parquet.read_table on windows python 3.82

2020-04-09 Thread Geoff Quested-Joens (Jira)
Geoff Quested-Joens created ARROW-8385:
--

 Summary: Crash on parquet.read_table on windows python 3.82
 Key: ARROW-8385
 URL: https://issues.apache.org/jira/browse/ARROW-8385
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
 Environment: Window 10 
python 3.8.2 pip 20.0.2
pip freeze ->
numpy==1.18.2
pandas==1.0.3
pyarrow==0.16.0
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0
Reporter: Geoff Quested-Joens
 Attachments: crash.parquet

On read of parquet file using pyarrow the program spontaneously exits no thrown 
exceptions windows only. Testing the same setup on linux (debian 10 in a 
Docker) reading the same parquet file is done without issue.

The follow can reproduce the crash in a python 3.8.2 environment env listed 
bellow but is essentially pip install pandas and pyarrow.
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def test_pandas_write_read():
df_out = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
df_out.to_parquet("crash.parquet")
df_in = pd.read_parquet("crash.parquet")
print(df_in)

def test_arrow_write_read():
df = pd.DataFrame.from_dict([{"A":i} for i in range(3)])
table_out = pa.Table.from_pandas(df)
pq.write_table(table_out, 'crash.parquet')
table_in = pq.read_table('crash.parquet')
print(table_in)

if _name_ == "_main_":
test_pandas_write_read()
test_arrow_write_read()
{code}
 The interpreter never reaches the print statements crashing somewhere in the 
call on line 252 of {{parquet.py}} no error is thrown just spontaneous program 
exit.
{code:python}
self.reader.read_all(...
{code}
In contrast running the same code and python environment in debian 10 there is 
no error reading the parquet files generated by the same windows code. The 
sha2sum compare equal for the crash.parquet generated running on debian and 
windows so something appears to be up with the read. Attached is the 
crash.parquet file generated on my machine.

Obtusely changing the {{range(3)}} to {{range(2)}} gets rid of the crash on 
windows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8384) [C++][Python] arrow/filesystem/hdfs.h and Python wrapper does not have an option for setting a path to a Kerberos ticket

2020-04-09 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8384:
---

 Summary: [C++][Python] arrow/filesystem/hdfs.h and Python wrapper 
does not have an option for setting a path to a Kerberos ticket
 Key: ARROW-8384
 URL: https://issues.apache.org/jira/browse/ARROW-8384
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


This feature seems to have been dropped

Is there a plan for migrating users to the new filesystem API? We have two 
different code paths now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8383) [RUST] Easier random access to DictionaryArray keys and values

2020-04-09 Thread Jira
Jörn Horstmann created ARROW-8383:
-

 Summary: [RUST] Easier random access to DictionaryArray keys and 
values
 Key: ARROW-8383
 URL: https://issues.apache.org/jira/browse/ARROW-8383
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jörn Horstmann


Currently it's not that clear how to acces DictionaryArray keys and values 
using random indices.

The `DictionaryArray::keys` method exposes an Iterator with an `nth` method, 
but this requires a mut reference and feels a little bit out of place compared 
to other methods of accessing arrow data.

Another alternative seems to be to use the `From for 
PrimitiveArray` conversion like so `let keys : Int16Array = 
dictionary_array.data().into()`. This seems to work fine but is not easily 
discoverable and also needs to be done outside of any loops for performance 
reasons.

I'd like methods on `DictionaryArray` to directly get the key at some index

```
 pub fn key(&self, i: usize) -> &K
```

Ideally I'd also like an easier way to directly access values at some index, at 
least when those are primitive or string types.

```
pub fn value(&self, i: usize) -> &T
```

I'm not sure how or if that would be possible to implement with rust generics.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes

2020-04-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8382:
-

 Summary: [C++][Dataset] Refactor WritePlan to decouple from 
Fragment/Scan/Partition classes 
 Key: ARROW-8382
 URL: https://issues.apache.org/jira/browse/ARROW-8382
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


WritePlan should look like the following. 

{code:c++}
class ARROW_DS_EXPORT WritePlan {
 public:
  /// Execute the WritePlan and return a FileSystemDataset as a result.
 Result Execute();

 protected:
  /// The schema of the Dataset which will be written
  std::shared_ptr schema;

  /// The format into which fragments will be written
  std::shared_ptr format;
 
  using SourceAndReader = std::pair;
  /// 
  std::vector outputs;
};
{code}

* Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not 
sure if it should take the output schema, or the RecordBatchReader should be 
already of the right schema.
* Add a class/function that constructs SourceAndReader from Fragments, 
Partitioning and base path. And remove any Write/Fragment logic from 
partition.cc.
* Move Write() out FIleSystemDataset into WritePlan. It could take a 
FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, 
not a requirement.
* Simplify writing routine to avoid the PathTree directory structure, it 
shouldn't be more complex than `for task in write_tasks: task()`. Not path 
construction should there.

The effects are:
* Simplified WritePlan execution, abstracted away from path construction, and 
can write to multiple FileSystem and/or Buffers since it doesn't construct the 
FileSource.
* By the virtue of using RecordBatchReader instead of Fragment, it isn't tied 
to writing from Fragment, it can take any construct that yields a 
RecordBatchReader. It also means that WritePlan doesn't have to know about any 
Scan related classes.
* Writing can be done with or without partitioning, this logic is given to 
whomever generates the SourceAndReader list.
* Should be simpler to test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8381) [C++][Dataset] Dataset writing should require a writer schema

2020-04-09 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8381:
-

 Summary: [C++][Dataset] Dataset writing should require a writer 
schema
 Key: ARROW-8381
 URL: https://issues.apache.org/jira/browse/ARROW-8381
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Francois Saint-Jacques


# Dataset writing should always take an explicit writer schema instead of the 
first fragment's schema.
# The MakeWritePlanImpl should not try removing columns that are found in the 
partition, this is left to the caller by passing an explicit schema.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8380) [RUST] StringDictionaryBuilder not publicly exported from arrow::array

2020-04-09 Thread Jira
Jörn Horstmann created ARROW-8380:
-

 Summary: [RUST] StringDictionaryBuilder not publicly exported from 
arrow::array
 Key: ARROW-8380
 URL: https://issues.apache.org/jira/browse/ARROW-8380
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jörn Horstmann






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8379) [R] Investigate/fix thread safety issues (esp. Windows)

2020-04-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8379:
--

 Summary: [R] Investigate/fix thread safety issues (esp. Windows)
 Key: ARROW-8379
 URL: https://issues.apache.org/jira/browse/ARROW-8379
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson


There have been a number of issues where the R bindings' multithreading has 
been implicated in unstable behavior (ARROW-7844 for example). In ARROW-8375 I 
disabled {{use_threads}} in the Windows tests, and it appeared that the 
mysterious Windows segfaults stopped. We should fix whatever the underlying 
issues are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8378) [Python] "empty" dtype metadata leads to wrong Parquet column type

2020-04-08 Thread Diego Argueta (Jira)
Diego Argueta created ARROW-8378:


 Summary: [Python] "empty" dtype metadata leads to wrong Parquet 
column type
 Key: ARROW-8378
 URL: https://issues.apache.org/jira/browse/ARROW-8378
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
 Environment: Python: 3.7.6
Pandas: 0.24.1, 0.25.3, 1.0.3
Pyarrow: 0.16.0
OS: OSX 10.15.3
Reporter: Diego Argueta


Run the following code with Pandas 0.24.x-1.0.x, and PyArrow 0.16.0 on Python 
3.7:
{code:python}
import pandas as pd
import numpy as np

df_1 = pd.DataFrame({'col': [None, None, None]})
df_1.col = df_1.col.astype(np.unicode_)
df_1.to_parquet('right.parq', engine='pyarrow')

series = pd.Series([None, None, None], dtype=np.unicode_)
df_2 = pd.DataFrame({'col': series})
df_2.to_parquet('wrong.parq', engine='pyarrow')
{code}
Examine the Parquet column type for each file (I use 
[parquet-tools|https://github.com/wesleypeck/parquet-tools]). {{right.parq}} 
has the expected UTF-8 string type. {{wrong.parq}} has an {{INT32}}.

The following metadata is stored in the Parquet files:

{{right.parq}}
{code:json}
{
  "column_indexes": [],
  "columns": [
{
  "field_name": "col",
  "metadata": null,
  "name": "col",
  "numpy_type": "object",
  "pandas_type": "unicode"
}
  ],
  "index_columns": [],
  "pandas_version": "0.24.1"
}
{code}
{{wrong.parq}}
{code:json}
{
  "column_indexes": [],
  "columns": [
{
  "field_name": "col",
  "metadata": null,
  "name": "col",
  "numpy_type": "object",
  "pandas_type": "empty"
}
  ],
  "index_columns": [],
  "pandas_version": "0.24.1"
}
{code}
The difference between the two is that the {{pandas_type}} for the incorrect 
file is "empty" rather than the expected "unicode". PyArrow misinterprets this 
and defaults to a 32-bit integer column.

The incorrect datatype will cause Redshift to reject the file when we try to 
read it because the column type in the file doesn't match the column type in 
the database table.

I originally filed this as a bug in Pandas (see [this 
ticket|https://github.com/pandas-dev/pandas/issues/25326]) but they punted me 
over here because the dtype conversion is handled in PyArrow. I'm not sure how 
you'd handle this here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8377) [CI][C++][R] Build and run C++ tests on Rtools build

2020-04-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8377:
--

 Summary: [CI][C++][R] Build and run C++ tests on Rtools build
 Key: ARROW-8377
 URL: https://issues.apache.org/jira/browse/ARROW-8377
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Continuous Integration, R
Reporter: Neal Richardson


Maybe this will better identify our unexplained segfaults



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8376) [R] Add experimental interface to ScanTask/RecordBatch iterators

2020-04-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8376:
--

 Summary: [R] Add experimental interface to ScanTask/RecordBatch 
iterators
 Key: ARROW-8376
 URL: https://issues.apache.org/jira/browse/ARROW-8376
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8375) [CI][R] Make Windows tests more verbose in case of segfault

2020-04-08 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8375:
--

 Summary: [CI][R] Make Windows tests more verbose in case of 
segfault
 Key: ARROW-8375
 URL: https://issues.apache.org/jira/browse/ARROW-8375
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

2020-04-08 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8374:
-

 Summary: [R] Table to vector of DictonaryType will error when 
Arrays don't have the same Dictionary per array
 Key: ARROW-8374
 URL: https://issues.apache.org/jira/browse/ARROW-8374
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Francois Saint-Jacques


The conversion should accommodate Unifying the dictionary before converting, 
otherwise the indices are simply broken



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8373) [GLib] Problems resolving gobject-introspection, arrow in Meson builds

2020-04-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8373:
---

 Summary: [GLib] Problems resolving gobject-introspection, arrow in 
Meson builds
 Key: ARROW-8373
 URL: https://issues.apache.org/jira/browse/ARROW-8373
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Wes McKinney
 Fix For: 0.17.0


See example failure 
https://github.com/apache/arrow/pull/6872/checks?check_run_id=571082161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8372) [C++] Add Result to table / record batch APIs

2020-04-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8372:
-

 Summary: [C++] Add Result to table / record batch APIs
 Key: ARROW-8372
 URL: https://issues.apache.org/jira/browse/ARROW-8372
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8371) [Crossbow] Implement and exercise sanity checks for tasks.yml

2020-04-08 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8371:
--

 Summary: [Crossbow] Implement and exercise sanity checks for 
tasks.yml 
 Key: ARROW-8371
 URL: https://issues.apache.org/jira/browse/ARROW-8371
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration, Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


See conversation at 
https://github.com/apache/arrow/pull/6868#issuecomment-610721717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8370) [C++] Add Result to type / schema APIs

2020-04-08 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8370:
-

 Summary: [C++] Add Result to type / schema APIs
 Key: ARROW-8370
 URL: https://issues.apache.org/jira/browse/ARROW-8370
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield


Buffers, Array builders (anythings in the parent directory src/arrow root 
directory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8369) [CI] Fix crossbow R group

2020-04-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8369:
--

 Summary: [CI] Fix crossbow R group
 Key: ARROW-8369
 URL: https://issues.apache.org/jira/browse/ARROW-8369
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Neal Richardson
Assignee: Neal Richardson


This was broken in ARROW-8356



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8368) [Format] In C interface, clarify resource management for consumers needing only a subset of child fields in ArrowArray

2020-04-07 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8368:
---

 Summary: [Format] In C interface, clarify resource management for 
consumers needing only a subset of child fields in ArrowArray
 Key: ARROW-8368
 URL: https://issues.apache.org/jira/browse/ARROW-8368
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney


The current implication of the C Interface is that only moving a single child 
out of an ArrowArray is allowed. 

Questions:

* Should it be allowed to move multiple children, as long as they are moved at 
the same time, and the parent is released after?
* In the event that children have disjoint internal resources, should there be 
a clarification around moved children having their resources released 
independently?

See mailing list discussion 
https://lists.apache.org/thread.html/r92b77e0fa7bed384daa377e2178bc8e8ca46103928598050341e40b1%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8367) [C++] Is FromString(..., pool) worthwhile

2020-04-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8367:
---

 Summary: [C++] Is FromString(..., pool) worthwhile
 Key: ARROW-8367
 URL: https://issues.apache.org/jira/browse/ARROW-8367
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
 Fix For: 1.0.0


>From [https://github.com/apache/arrow/pull/6863#discussion_r404913683]

There are currently two overloads of {{Buffer::FromString}}, one which takes an 
rvalue reference to string and another which takes a const reference and a 
MemoryPool. In the former case the string is simply moved into a Buffer 
subclass while in the latter the MemoryPool is used to allocate space into 
which the string's contents are copied, which necessitates bubbling the 
potential allocation failure. This seems gratuitous given we don't use 
{{std::string}} to store large quantities so it should be fine to provide only
{code:java}
  static std::unique_ptr FromString(std::string data); 
{code}
and rely on {{std::string}}'s copy constructor when the argument is not an 
rvalue.

In the case of a {{std::string}} which may/does contain large data and must be 
copied, tracking the copied memory with a MemoryPool does not require a great 
deal of boilerplate:
{code:java}
ARROW_ASSIGN_OR_RAISE(auto buffer,
  Buffer(large).CopySlice(0, large.size(), pool));
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8366) [Rust] Need to revert recent arrow-flight build change

2020-04-07 Thread Andy Grove (Jira)
Andy Grove created ARROW-8366:
-

 Summary: [Rust] Need to revert recent arrow-flight build change
 Key: ARROW-8366
 URL: https://issues.apache.org/jira/browse/ARROW-8366
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.17.0


The PR  [1] merged for ARROW-7794 causes problems with projects that have a 
dependency on this crate where the build.rs code becomes an infinite loop 
looking for a parent directory named "arrow" that doesn't exist.

This PR simply reverts that change. I will need to find a better approach to 
resolving the original issue.

 [1] https://github.com/apache/arrow/pull/6858



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8365) arrow-cpp: Error when writing files to S3 larger than 5 GB

2020-04-07 Thread Juan Galvez (Jira)
Juan Galvez created ARROW-8365:
--

 Summary: arrow-cpp: Error when writing files to S3 larger than 5 GB
 Key: ARROW-8365
 URL: https://issues.apache.org/jira/browse/ARROW-8365
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Juan Galvez


When purely using the arrow-cpp library to write to S3, I get the following 
error when trying to write a large Arrow table to S3 (resulting in a file size 
larger than 5 GB):

{{../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of type 
N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading part for 
key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error [code 100]: 
Unable to parse ExceptionName: EntityTooLarge Message: Your proposed upload 
exceeds the maximum allowed size with address : 52.219.100.32}}

I have diagnosed the problem by looking at and modifying the code in 
*{{s3fs.cc}}*. The code uses multipart upload, and uses 5 MB chunks for the 
first 100 parts. After it has submitted the first 100 parts, it is supposed to 
increase the size of the chunks to 10 MB (the part upload threshold or 
{{part_upload_threshold_}}). The issue is that the threshold is increased 
inside {{DoWrite}}, and {{DoWrite}} can be called multiple times before the 
current part is uploaded, which ultimately causes the threshold to keep getting 
increased indefinitely, and the last part ends up surpassing the 5 GB part 
upload limit of AWS/S3.

This issue where the last part is much larger than it should I'm pretty sure 
can happen every time a multi-part upload exceeds 100 parts, but the error is 
only thrown if the last part is larger than 5 GB. Therefore this is only 
observed with very large uploads.

I can confirm that the bug does not happen if I move this:

{{if (part_number_ % 100 == 0) {}}
{{ part_upload_threshold_ += kMinimumPartUpload;}}
{{ }}}


and do it in a different method, right before the line that does: 
{{++part_number_}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8364) Get Access to the type_to_type_id dictionary

2020-04-07 Thread Or (Jira)
Or created ARROW-8364:
-

 Summary: Get Access to the type_to_type_id dictionary
 Key: ARROW-8364
 URL: https://issues.apache.org/jira/browse/ARROW-8364
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Or


Hi,
h3. *The Problem:*

Currently, if I try to serialize and it can't be serialized by the default 
serialization context I get 
SerializationCallbackError. So the problem is that I have to try to serialize 
the object in order to know if it is serializable by the package. That can be a 
very expensive operation for a simple check if the object contains a large 
amount of data.
 
h3. *The* *Requested*  *Improvement / Feature**:*
 
A function that checks if the type of the object I'm about to serialize is 
serializable by the package (meaning it is registered under the type_to_type_id 
dictionary).  
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8363) [Archery] Comment bot should report any errors happening during crossbow submit

2020-04-07 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8363:
--

 Summary: [Archery] Comment bot should report any errors happening 
during crossbow submit
 Key: ARROW-8363
 URL: https://issues.apache.org/jira/browse/ARROW-8363
 Project: Apache Arrow
  Issue Type: Task
  Components: Archery
Reporter: Krisztian Szucs


We already get a feedback to the github comment, but no error message. 

 

Example failure 
https://github.com/apache/arrow/runs/567644496?check_suite_focus=true#step:5:42



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8362) [Crossbow] Ensure that the locally generated version is used in the docker tasks

2020-04-07 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8362:
--

 Summary: [Crossbow] Ensure that the locally generated version is 
used in the docker tasks
 Key: ARROW-8362
 URL: https://issues.apache.org/jira/browse/ARROW-8362
 Project: Apache Arrow
  Issue Type: Task
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.17.0


Arrow fork might not have the version tags, so the scm based version generation 
can't work. 

Pass the locally detected version to the docker builds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8361) [C++] Add Result APIs to Buffer methods and functions

2020-04-07 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8361:
-

 Summary: [C++] Add Result APIs to Buffer methods and functions
 Key: ARROW-8361
 URL: https://issues.apache.org/jira/browse/ARROW-8361
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8360) Fixes date32 support for date/time functions

2020-04-06 Thread Yuan Zhou (Jira)
Yuan Zhou created ARROW-8360:


 Summary: Fixes date32 support for date/time functions
 Key: ARROW-8360
 URL: https://issues.apache.org/jira/browse/ARROW-8360
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Yuan Zhou
Assignee: Yuan Zhou


Gandiva date/time functions like extractYear[1] only work with millisecond, 
passing date32 to these functions will get wrong results.


[1]https://github.com/apache/arrow/blob/6d92694d00aec08081ae1bfe06f0a265e141b1b7/cpp/src/gandiva/precompiled/time.cc#L75-L80




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8359) [C++/Python] Enable aarch64/ppc64le build in conda recipes

2020-04-06 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-8359:
---

 Summary: [C++/Python] Enable aarch64/ppc64le build in conda recipes
 Key: ARROW-8359
 URL: https://issues.apache.org/jira/browse/ARROW-8359
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Packaging, Python
Reporter: Uwe Korn
 Fix For: 0.17.0


These two new arches were added in the conda recipes, we should also build them 
as nightlies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8358) [C++] Fix -Wrange-loop-construct warnings in clang-11

2020-04-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8358:
---

 Summary: [C++] Fix -Wrange-loop-construct warnings in clang-11 
 Key: ARROW-8358
 URL: https://issues.apache.org/jira/browse/ARROW-8358
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


We might change one of our CI entries to use clang-11 so we get some more 
bleeding edge compiler warnings, to get out ahead of things



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8357) [Rust] [DataFusion] Dockerfile for CLI is missing format dir

2020-04-06 Thread Andy Grove (Jira)
Andy Grove created ARROW-8357:
-

 Summary: [Rust] [DataFusion] Dockerfile for CLI is missing format 
dir
 Key: ARROW-8357
 URL: https://issues.apache.org/jira/browse/ARROW-8357
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.17.0


{code:java}
error: failed to run custom build command for `arrow-flight v1.0.0-SNAPSHOT 
(/arrow/rust/arrow-flight)`Caused by:
  process didn't exit successfully: 
`/arrow/rust/target/release/build/arrow-flight-a0fb14daffea70f5/build-script-build`
 (exit code: 1)
--- stderr
Error: Custom { kind: Other, error: "protoc failed: ../../format: warning: 
directory does not exist.\nCould not make proto path relative: 
../../format/Flight.proto: No such file or directory\n" }warning: build failed, 
waiting for other jobs to finish...
error: failed to compile `datafusion v1.0.0-SNAPSHOT (/arrow/rust/datafusion)`, 
intermediate artifacts can be found at `/arrow/rust/target`Caused by:
  build failed
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8356) [Developer] Support * wildcards with "crossbow submit" via GitHub actions

2020-04-06 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8356:
---

 Summary: [Developer] Support * wildcards with "crossbow submit" 
via GitHub actions
 Key: ARROW-8356
 URL: https://issues.apache.org/jira/browse/ARROW-8356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney


While the "group" feature can be useful, sometimes there is a group of builds 
that do not fit neatly into a particular group



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8355) [Python] Reduce the number of pandas dependent test cases in test_feather

2020-04-06 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8355:
--

 Summary: [Python] Reduce the number of pandas dependent test cases 
in test_feather
 Key: ARROW-8355
 URL: https://issues.apache.org/jira/browse/ARROW-8355
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Krisztian Szucs
 Fix For: 1.0.0


See comment https://github.com/apache/arrow/pull/6849#discussion_r404160096



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8354) [C++][R] Segfault in test-dataset.r

2020-04-06 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8354:
-

 Summary: [C++][R] Segfault in test-dataset.r
 Key: ARROW-8354
 URL: https://issues.apache.org/jira/browse/ARROW-8354
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, R
Reporter: Francois Saint-Jacques


See https://github.com/fsaintjacques/arrow/runs/564315427#step:6:2169



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8353) [C++] is_nullable maybe not initialized in parquet writer

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8353:
--

 Summary: [C++] is_nullable maybe not initialized in parquet writer
 Key: ARROW-8353
 URL: https://issues.apache.org/jira/browse/ARROW-8353
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson


>From the Rtools build:

{code}
[ 84%] Building CXX object 
src/parquet/CMakeFiles/parquet_static.dir/column_reader.cc.obj
In file included from D:/a/arrow/arrow/cpp/src/arrow/io/concurrency.h:23:0,
 from D:/a/arrow/arrow/cpp/src/arrow/io/memory.h:25,
 from D:/a/arrow/arrow/cpp/src/parquet/platform.h:25,
 from D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.h:23,
 from D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.cc:18:
D:/a/arrow/arrow/cpp/src/arrow/result.h: In member function 'virtual 
arrow::Status parquet::arrow::FileWriterImpl::WriteColumnChunk(const 
std::shared_ptr&, int64_t, int64_t)':
D:/a/arrow/arrow/cpp/src/arrow/result.h:428:28: warning: 'is_nullable' may be 
used uninitialized in this function [-Wmaybe-uninitialized]
   auto result_name = (rexpr);   \
^
D:/a/arrow/arrow/cpp/src/parquet/arrow/writer.cc:430:10: note: 'is_nullable' 
was declared here
 bool is_nullable;
  ^
{code}

I'd give it a default value, but IDK that it's that simple.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8352) [R] Add install_pyarrow()

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8352:
--

 Summary: [R] Add install_pyarrow()
 Key: ARROW-8352
 URL: https://issues.apache.org/jira/browse/ARROW-8352
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neal Richardson
Assignee: Neal Richardson


To facilitate installing for use with reticulate, including handling how to use 
the nightly packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8351) [R][CI] Store the Rtools-built Arrow C++ library as a build artifact

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8351:
--

 Summary: [R][CI] Store the Rtools-built Arrow C++ library as a 
build artifact
 Key: ARROW-8351
 URL: https://issues.apache.org/jira/browse/ARROW-8351
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neal Richardson
Assignee: Neal Richardson


To help with debugging unexplained segfaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8350) [Python] Implement to_numpy on ChunkedArray

2020-04-06 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-8350:
---

 Summary: [Python] Implement to_numpy on ChunkedArray
 Key: ARROW-8350
 URL: https://issues.apache.org/jira/browse/ARROW-8350
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe Korn


We support {{to_numpy}} on Array instances but not on {{ChunkedArray}} 
instances. It would be quite useful to have it also there to support returning 
e.g. non-nanosecond datetime instances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8349) [CI][NIGHTLY:gandiva-jar-osx] Use latest pygit2

2020-04-06 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-8349:
---

 Summary: [CI][NIGHTLY:gandiva-jar-osx] Use latest pygit2
 Key: ARROW-8349
 URL: https://issues.apache.org/jira/browse/ARROW-8349
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


Now that homebrew provides compatible libgit2 version, we can use latest pygit2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8348) [C++] Support optional sentinel values in primitive Array for nulls

2020-04-06 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8348:
-

 Summary: [C++] Support optional sentinel values in primitive Array 
for nulls
 Key: ARROW-8348
 URL: https://issues.apache.org/jira/browse/ARROW-8348
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


This is an optional feature where a sentinel value is stored in null cells and 
is exposed via an accessor method, e.g. `optional Array::HasSentinel() 
const;`. This would allow zero-copy bi-directional conversion with R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8347) [C++] Add Result to Array methods

2020-04-06 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8347:
-

 Summary: [C++] Add Result to Array methods
 Key: ARROW-8347
 URL: https://issues.apache.org/jira/browse/ARROW-8347
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield


Buffers, Array builders (anythings in the parent directory src/arrow root 
directory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8346) [CI][Ruby] GLib/Ruby macOS build fails on zlib

2020-04-06 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8346:
--

 Summary: [CI][Ruby] GLib/Ruby macOS build fails on zlib
 Key: ARROW-8346
 URL: https://issues.apache.org/jira/browse/ARROW-8346
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, GLib
Reporter: Neal Richardson
 Fix For: 0.17.0


See https://github.com/apache/arrow/runs/564610412 for example.

{code}
Using 'PKG_CONFIG_PATH' from environment with value: '/usr/local/lib/pkgconfig'
Run-time dependency gobject-2.0 found: YES 2.64.1
Run-time dependency gio-2.0 found: NO (tried framework and cmake)

c_glib/arrow-glib/meson.build:210:0: ERROR: Could not generate cargs for 
gio-2.0:
Package zlib was not found in the pkg-config search path.
Perhaps you should add the directory containing `zlib.pc'
to the PKG_CONFIG_PATH environment variable
Package 'zlib', required by 'gio-2.0', not found


A full log can be found at 
/Users/runner/runners/2.168.0/work/arrow/arrow/build/c_glib/meson-logs/meson-log.txt
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8345) [Python] feather.read_table should not require pandas

2020-04-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8345:


 Summary: [Python] feather.read_table should not require pandas
 Key: ARROW-8345
 URL: https://issues.apache.org/jira/browse/ARROW-8345
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0


We still check the pandas version, while pandas is not actually needed. Will do 
a quick fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8344) [C#] StringArray.Builder.Clear() corrupts subsequent array contents

2020-04-06 Thread Adam Szmigin (Jira)
Adam Szmigin created ARROW-8344:
---

 Summary: [C#] StringArray.Builder.Clear() corrupts subsequent 
array contents
 Key: ARROW-8344
 URL: https://issues.apache.org/jira/browse/ARROW-8344
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Affects Versions: 0.16.0
 Environment: Windows 10 x64
Reporter: Adam Szmigin


h1. Summary

Using the {{Clear()}} method on a {{StringArray.Builder}} class causes all 
subsequent built arrays to contain strings consisting solely of whitespace.  
The below minimal example illustrates:
{code:java}
namespace ArrowStringArrayBuilderBug
{
using Apache.Arrow;
using Apache.Arrow.Memory;

public class Program
{
private static readonly NativeMemoryAllocator Allocator
= new NativeMemoryAllocator();

public static void Main()
{
var builder = new StringArray.Builder();
AppendBuildPrint(builder, "Hello", "World");
builder.Clear();
AppendBuildPrint(builder, "Foo", "Bar");
}

private static void AppendBuildPrint(
StringArray.Builder builder, params string[] strings)
{
foreach (var elem in strings)
builder.Append(elem);

var arr = builder.Build(Allocator);
System.Console.Write("Array contents: [");
for (var i = 0; i < arr.Length; i++)
{
if (i > 0) System.Console.Write(", ");
System.Console.Write($"'{arr.GetString(i)}'");
}
System.Console.WriteLine("]");
}
}
{code}
h2. Expected Output
{noformat}
Array contents: ['Hello', 'World']
Array contents: ['Foo', 'Bar']
{noformat}
h2. Actual Output
{noformat}
Array contents: ['Hello', 'World']
Array contents: ['   ', '   '] {noformat}
h1. Workaround

The bug can be trivially worked around by constructing a new 
{{StringArray.Builder}} instead of calling {{Clear()}}.

The issue ARROW-7040 mentions other issues with string arrays in C#, but I'm 
not sure if this is related or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8343) [GLib] Add GArrowRecordBatchIterator

2020-04-06 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-8343:
---

 Summary: [GLib] Add GArrowRecordBatchIterator
 Key: ARROW-8343
 URL: https://issues.apache.org/jira/browse/ARROW-8343
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8342) [Python] dask and kartothek integration tests are failing

2020-04-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8342:


 Summary: [Python] dask and kartothek integration tests are failing
 Key: ARROW-8342
 URL: https://issues.apache.org/jira/browse/ARROW-8342
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


The integration tests for both dask and kartothek, and for both master and 
latest released version of them, started failing the last days.

Dask latest: 
https://circleci.com/gh/ursa-labs/crossbow/10629?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
 
Kartothek latest: 
https://circleci.com/gh/ursa-labs/crossbow/10604?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

I think both are related to the KeyValueMetadata changes (ARROW-8079).

The kartothek one is clearly related, as it gives: TypeError: 
'pyarrow.lib.KeyValueMetadata' object does not support item assignment

And I think the dask one is related to the "pandas" key now being present 
twice, and therefore it is using the "wrong" one.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8341) [Packaging][deb] Fail to build by no disk space

2020-04-05 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8341:
---

 Summary: [Packaging][deb] Fail to build by no disk space
 Key: ARROW-8341
 URL: https://issues.apache.org/jira/browse/ARROW-8341
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8340) [Documentation] Sphinx documentation does not build with just-released Sphinx 3.0.0

2020-04-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8340:
---

 Summary: [Documentation] Sphinx documentation does not build with 
just-released Sphinx 3.0.0
 Key: ARROW-8340
 URL: https://issues.apache.org/jira/browse/ARROW-8340
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, Python
Reporter: Wes McKinney
 Fix For: 0.17.0


I'll add a version pin in a docs PR I'm working on, but this needs to be fixed 
soon



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8339) [C++] Possibly allow null offsets and/or data buffer for BaseBinaryArray for 0-length arrays

2020-04-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8339:
---

 Summary: [C++] Possibly allow null offsets and/or data buffer for 
BaseBinaryArray for 0-length arrays
 Key: ARROW-8339
 URL: https://issues.apache.org/jira/browse/ARROW-8339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


related to ARROW-8338. This issues was raised in ARROW-7008 but we maintained 
the status quo of requiring non-null buffers in both cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8338) [Format] Clarify whether 0-length variable offsets buffers are permissible for 0-length arrays in the IPC protocol

2020-04-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8338:
---

 Summary: [Format] Clarify whether 0-length variable offsets 
buffers are permissible for 0-length arrays in the IPC protocol
 Key: ARROW-8338
 URL: https://issues.apache.org/jira/browse/ARROW-8338
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Wes McKinney


This aspect of the columnar format / IPC protocol remains slightly unclear. As 
written, it would suggest that an offsets buffer of length 1 containing a 
single value 0 is required. It may be better to allow this to be length zero 
(corresponding to a 0-size or null buffer in the implementation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8337) [Release] Verify release candidate wheels without using conda

2020-04-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8337:
--

 Summary: [Release] Verify release candidate wheels without using 
conda
 Key: ARROW-8337
 URL: https://issues.apache.org/jira/browse/ARROW-8337
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Neal Richardson


See final comments on ARROW-2880



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8336) [Packaging][deb] Use libthrift-dev on Debian 10 and Ubuntu 19.10 or later

2020-04-04 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8336:
---

 Summary: [Packaging][deb] Use libthrift-dev on Debian 10 and 
Ubuntu 19.10 or later
 Key: ARROW-8336
 URL: https://issues.apache.org/jira/browse/ARROW-8336
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8335) [Release] Add crossbow jobs to run release verification

2020-04-04 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8335:
--

 Summary: [Release] Add crossbow jobs to run release verification
 Key: ARROW-8335
 URL: https://issues.apache.org/jira/browse/ARROW-8335
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.17.0


Workflow: edit version number and rc number in template in 
{{dev/release/github.verify.yml}}, make PR, and do 

* {{@github-actions crossbow submit -g verify-rc}} to run everything
* {{@github-actions crossbow submit -g verify-rc-wheel|source|binary}} to run 
those groups
* Other groups at {{verify-rc-wheel|source-macos|ubuntu|windows}}, 
{{verify-rc-source-cpp|csharp|java|etc.}}
* Individual workflows at e.g. {{verify-rc-wheel-windows}}, 
{{verify-rc-source-macos-csharp}}. We could break out the wheel verification by 
python version (maybe we should), but that requires changes to the verification 
scripts themselves.

Running the main {{verify-rc}} group will put a ton of workflow svg badges on 
the PR so we can see at a glance what is passing and failing. If things fail 
when running all, can push fixes to the verification script to the branch and 
retry just those that failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8334) Missing DATE32 in Gandiva

2020-04-03 Thread Dominik Durner (Jira)
Dominik Durner created ARROW-8334:
-

 Summary: Missing DATE32 in Gandiva
 Key: ARROW-8334
 URL: https://issues.apache.org/jira/browse/ARROW-8334
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Dominik Durner






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8333) [C++][CI] Always that benchmarks compile in some C++ CI entry

2020-04-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8333:
---

 Summary: [C++][CI] Always that benchmarks compile in some C++ CI 
entry 
 Key: ARROW-8333
 URL: https://issues.apache.org/jira/browse/ARROW-8333
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


As exposed in ARROW-8331, apparently we do not check. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8332) [C++] Require Thrift compiler to use system libthrift for Parquet build

2020-04-03 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8332:
---

 Summary: [C++] Require Thrift compiler to use system libthrift for 
Parquet build
 Key: ARROW-8332
 URL: https://issues.apache.org/jira/browse/ARROW-8332
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8331) [C++] arrow-compute-filter-benchmark fails to compile

2020-04-03 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8331:
---

 Summary: [C++] arrow-compute-filter-benchmark fails to compile
 Key: ARROW-8331
 URL: https://issues.apache.org/jira/browse/ARROW-8331
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.17.0


Are the benchmarks not being built in CI?

{code}
../src/arrow/compute/kernels/filter_benchmark.cc:45:18: error: no matching 
function for call to 'Filter'
ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out));
 ^~
../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 
'ABORT_NOT_OK'
auto _res = (expr); \
 ^~~~
../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not 
viable: requires 5 arguments, but 4 were provided
Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter,
   ^
../src/arrow/compute/kernels/filter_benchmark.cc:66:18: error: no matching 
function for call to 'Filter'
ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out));
 ^~
../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 
'ABORT_NOT_OK'
auto _res = (expr); \
 ^~~~
../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not 
viable: requires 5 arguments, but 4 were provided
Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter,
   ^
../src/arrow/compute/kernels/filter_benchmark.cc:90:18: error: no matching 
function for call to 'Filter'
ABORT_NOT_OK(Filter(&ctx, Datum(array), Datum(filter), &out));
 ^~
../src/arrow/testing/gtest_util.h:109:18: note: expanded from macro 
'ABORT_NOT_OK'
auto _res = (expr); \
 ^~~~
../src/arrow/compute/kernels/filter.h:65:8: note: candidate function not 
viable: requires 5 arguments, but 4 were provided
Status Filter(FunctionContext* ctx, const Datum& values, const Datum& filter,
   ^
3 errors generated.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8330) [Documentation] The post release script generates the documentation with a development version

2020-04-03 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8330:
--

 Summary: [Documentation] The post release script generates the 
documentation with a development version
 Key: ARROW-8330
 URL: https://issues.apache.org/jira/browse/ARROW-8330
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.17.0


See the current documentation page. Also regenerate the github page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8329) [Documentation][C++] Undocumented FilterOptions argument in Filter kernel

2020-04-03 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8329:
--

 Summary: [Documentation][C++] Undocumented FilterOptions argument 
in Filter kernel
 Key: ARROW-8329
 URL: https://issues.apache.org/jira/browse/ARROW-8329
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.17.0


The documentation build fails, see 
https://github.com/apache/arrow/runs/558617620#step:6:1186



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8328) [C++] MSVC is not respecting warning-disable flags

2020-04-03 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-8328:
---

 Summary: [C++] MSVC is not respecting warning-disable flags
 Key: ARROW-8328
 URL: https://issues.apache.org/jira/browse/ARROW-8328
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.16.0
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


We provide [warning-disabling flags to 
MSVC|https://github.com/apache/arrow/blob/72433c6/cpp/cmake_modules/SetupCxxFlags.cmake#L151-L153]
 including one which should disable all conversion warnings. However this is 
not completely effectual and Appveyor will still emit conversion warnings 
(which are then treated as errors), requiring insertion of otherwise 
unnecessary explicit casts or {{#pragma}}s (for example 
https://github.com/apache/arrow/pull/6820 ).

Perhaps flag ordering is significant? In any case, as we have conversion 
warnings disabled for other compilers we should ensure they are completely 
disabled for MSVC as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    3   4   5   6   7   8   9   10   11   12   >