[jira] [Updated] (ARROW-5448) [CI] MinGW build failures on AppVeyor

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5448:
--
Labels: pull-request-available  (was: )

> [CI] MinGW build failures on AppVeyor
> -
>
> Key: ARROW-5448
> URL: https://issues.apache.org/jira/browse/ARROW-5448
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Blocker
>  Labels: pull-request-available
>
> Apparently the Numpy package is broken. See 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24922425/job/9yoq08uepk5p6dwb
> {code}
> -- Found PythonLibs: C:/msys64/mingw32/lib/libpython3.7m.dll.a
> CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\__init__.py", line 
> 40, in 
>   from . import multiarray
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\multiarray.py", 
> line 12, in 
>   from . import overrides
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\overrides.py", line 
> 6, in 
>   from numpy.core._multiarray_umath import (
>   ImportError: DLL load failed: The specified module could not be found.
>   
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1957) [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit

2019-05-30 Thread TP Boudreau (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852680#comment-16852680
 ] 

TP Boudreau commented on ARROW-1957:


Yes, thanks for assigning it.

> [Python] Write nanosecond timestamps using new NANO LogicalType Parquet unit
> 
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests

2019-05-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-1837:
--

Assignee: Micah Kornfield

> [Java] Unable to read unsigned integers outside signed range for bit width in 
> integration tests
> ---
>
> Key: ARROW-1837
> URL: https://issues.apache.org/jira/browse/ARROW-1837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Blocker
>  Labels: columnar-format-1.0
> Fix For: 0.14.0
>
> Attachments: generated_primitive.json
>
>
> I believe this was introduced recently (perhaps in the refactors), but there 
> was a problem where the integration tests weren't being properly run that hid 
> the error from us
> see https://github.com/apache/arrow/pull/1294#issuecomment-345553066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5429) [Java] Provide alternative buffer allocation policy

2019-05-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5429.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4400
[https://github.com/apache/arrow/pull/4400]

> [Java] Provide alternative buffer allocation policy
> ---
>
> Key: ARROW-5429
> URL: https://issues.apache.org/jira/browse/ARROW-5429
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> The current buffer allocation policy works like this:
>  * If the requested buffer size is greater than or equal to the chunk size, 
> the buffer size will be as is.
>  * If the requested size is within the chunk size, the buffer size will be 
> rounded to the next power of 2.
> This policy can lead to waste of memory in some cases. For example, if we 
> request a buffer of size 10MB, Arrow will round the buffer size to 16 MB. If 
> we only need 10 MB, this will lead to a waste of (16 - 10) / 10 = 60% of 
> memory.
> So in this proposal, we provide another policy: the rounded buffer size must 
> be a multiple of some memory unit, like (32 KB). This policy has two benefits:
>  # The wasted memory cannot exceed one memory unit (32 KB), which is much 
> smaller than the power-of-two policy.
>  # This is the memory allocation policy adopted by some computation engines 
> (e.g. Apache Flink). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5420) [Java] Implement or remove getCurrentSizeInBytes in VariableWidthVector

2019-05-30 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5420.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4390
[https://github.com/apache/arrow/pull/4390]

> [Java] Implement or remove getCurrentSizeInBytes in VariableWidthVector
> ---
>
> Key: ARROW-5420
> URL: https://issues.apache.org/jira/browse/ARROW-5420
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Now VariableWidthVector#getCurrentSizeInBytes doesn't seem to have been 
> implemented. We should implement it or just remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization

2019-05-30 Thread Yuqi Gu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852610#comment-16852610
 ] 

Yuqi Gu commented on ARROW-5458:


PR: https://github.com/apache/arrow/pull/4427

> Apache Arrow parallel CRC32c computation optimization
> -
>
> Key: ARROW-5458
> URL: https://issues.apache.org/jira/browse/ARROW-5458
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARMv8 defines VMULL/PMULL crypto instruction.
> This patch optimizes crc32c calculate with the instruction when
> available rather than original linear crc instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5458:
--
Labels: pull-request-available  (was: )

> Apache Arrow parallel CRC32c computation optimization
> -
>
> Key: ARROW-5458
> URL: https://issues.apache.org/jira/browse/ARROW-5458
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Priority: Minor
>  Labels: pull-request-available
>
> ARMv8 defines VMULL/PMULL crypto instruction.
> This patch optimizes crc32c calculate with the instruction when
> available rather than original linear crc instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5458) Apache Arrow parallel CRC32c computation optimization

2019-05-30 Thread Yuqi Gu (JIRA)
Yuqi Gu created ARROW-5458:
--

 Summary: Apache Arrow parallel CRC32c computation optimization
 Key: ARROW-5458
 URL: https://issues.apache.org/jira/browse/ARROW-5458
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yuqi Gu


ARMv8 defines VMULL/PMULL crypto instruction.
This patch optimizes crc32c calculate with the instruction when
available rather than original linear crc instructions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5457) [GLib][Plasma] Environment variable name for test is wrong

2019-05-30 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-5457:
---

 Summary: [GLib][Plasma] Environment variable name for test is wrong
 Key: ARROW-5457
 URL: https://issues.apache.org/jira/browse/ARROW-5457
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Affects Versions: 0.13.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5457) [GLib][Plasma] Environment variable name for test is wrong

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5457:
--
Labels: pull-request-available  (was: )

> [GLib][Plasma] Environment variable name for test is wrong
> --
>
> Key: ARROW-5457
> URL: https://issues.apache.org/jira/browse/ARROW-5457
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Affects Versions: 0.13.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5456) [GLib][Plasma] Installed plasma-glib may be used on building document

2019-05-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5456:
--
Labels: pull-request-available  (was: )

> [GLib][Plasma] Installed plasma-glib may be used on building document
> -
>
> Key: ARROW-5456
> URL: https://issues.apache.org/jira/browse/ARROW-5456
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Affects Versions: 0.13.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5456) [GLib][Plasma] Installed plasma-glib may be used on building document

2019-05-30 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-5456:
---

 Summary: [GLib][Plasma] Installed plasma-glib may be used on 
building document
 Key: ARROW-5456
 URL: https://issues.apache.org/jira/browse/ARROW-5456
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Affects Versions: 0.13.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5073) [C++] Build toolchain support for libcurl

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5073:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Build toolchain support for libcurl
> -
>
> Key: ARROW-5073
> URL: https://issues.apache.org/jira/browse/ARROW-5073
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> libcurl can be used in a number of different situations (e.g. TensorFlow uses 
> it for GCS interactions 
> https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/cloud/gcs_file_system.cc)
>  so this will likely be required once we begin to tackle that problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5344) [C++] Use ArrayDataVisitor in implementation of dictionary unpacking in compute/kernels/cast.cc

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5344:

Fix Version/s: (was: 0.14.0)

> [C++] Use ArrayDataVisitor in implementation of dictionary unpacking in 
> compute/kernels/cast.cc
> ---
>
> Key: ARROW-5344
> URL: https://issues.apache.org/jira/browse/ARROW-5344
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Follow-up to code review from ARROW-3144



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5334) [C++] Add "Type" to names of arrow::Integer, arrow::FloatingPoint classes for consistency

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5334:
---

Assignee: Wes McKinney

> [C++] Add "Type" to names of arrow::Integer, arrow::FloatingPoint classes for 
> consistency
> -
>
> Key: ARROW-5334
> URL: https://issues.apache.org/jira/browse/ARROW-5334
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> These intermediate classes used for template metaprogramming (in particular, 
> {{std::is_base_of}}) have inconsistent names with the rest of data types. For 
> clarity, I think we should add "Type" to these class names and others like 
> them
> Please do after ARROW-3144



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2446) [C++] SliceBuffer on CudaBuffer should return CudaBuffer

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2446:

Fix Version/s: (was: 0.14.0)

> [C++] SliceBuffer on CudaBuffer should return CudaBuffer
> 
>
> Key: ARROW-2446
> URL: https://issues.apache.org/jira/browse/ARROW-2446
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GPU
>Affects Versions: 0.9.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently {{SliceBuffer}} on a {{CudaBuffer}} returns a plain {{Buffer}} 
> instance, which is dangerous for unsuspecting consumers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4845) [C++] Compiler warnings on Windows MingW64

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4845:

Summary: [C++] Compiler warnings on Windows MingW64  (was: Compiler 
warnings on Windows)

> [C++] Compiler warnings on Windows MingW64
> --
>
> Key: ARROW-4845
> URL: https://issues.apache.org/jira/browse/ARROW-4845
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Priority: Major
> Fix For: 0.14.0
>
>
> I am seeing the warnings below when compiling the R bindings on Windows. Most 
> of these seem easy to fix (comparing int with size_t or int32 with int64).
> {code}
> array.cpp: In function 'Rcpp::LogicalVector Array__Mask(const 
> std::shared_ptr&)':
> array.cpp:102:24: warning: comparison of integer expressions of different 
> signedness: 'size_t' {aka 'long long unsigned int'} and 'int64_t' {aka 'long 
> long int'} [-Wsign-compare]
>for (size_t i = 0; i < array->length(); i++, bitmap_reader.Next()) {
>   ~~^
> /mingw64/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-testing/include" -DNDEBUG 
> -DARROW_STATIC -I"C:/R/library/Rcpp/include"-O2 -Wall  -mtune=generic 
> -c array__to_vector.cpp -o array__to_vector.o
> array__to_vector.cpp: In member function 'virtual arrow::Status 
> arrow::r::Converter_Boolean::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const':
> array__to_vector.cpp:254:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, data_reader.Next(), null_reader.Next(), 
> ++p_data) {
>   ~~^~~
> array__to_vector.cpp:258:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, data_reader.Next(), ++p_data) {
>   ~~^~~
> array__to_vector.cpp: In member function 'virtual arrow::Status 
> arrow::r::Converter_Decimal::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const':
> array__to_vector.cpp:473:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data) {
>   ~~^~~
> array__to_vector.cpp:478:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, ++p_data) {
>   ~~^~~
> array__to_vector.cpp: In member function 'virtual arrow::Status 
> arrow::r::Converter_Int64::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const':
> array__to_vector.cpp:515:28: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data) {
>   ~~^~~
> array__to_vector.cpp: In instantiation of 'arrow::Status 
> arrow::r::SomeNull_Ingest(SEXP, R_xlen_t, R_xlen_t, const array_value_type*, 
> const std::shared_ptr&, Lambda) [with int RTYPE = 14; 
> array_value_type = long long int; Lambda = 
> arrow::r::Converter_Date64::Ingest_some_nulls(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const::; 
> SEXP = SEXPREC*; R_xlen_t = long long int]':
> array__to_vector.cpp:366:77:   required from here
> array__to_vector.cpp:116:26: warning: comparison of integer expressions of 
> different signedness: 'size_t' {aka 'long long unsigned int'} and 'R_xlen_t' 
> {aka 'long long int'} [-Wsign-compare]
>  for (size_t i = 0; i < n; i++, bitmap_reader.Next(), ++p_data, 
> ++p_values) {
> ~~^~~
> array__to_vector.cpp: In instantiation of 'arrow::Status 
> arrow::r::SomeNull_Ingest(SEXP, R_xlen_t, R_xlen_t, const array_value_type*, 
> const std::shared_ptr&, Lambda) [with int RTYPE = 13; 
> array_value_type = unsigned char; Lambda = 
> arrow::r::Converter_Dictionary::Ingest_some_nulls_Impl(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const [with Type = 
> arrow::UInt8Type; SEXP = SEXPREC*; R_xlen_t = long long 
> int]::; SEXP = SEXPREC*; R_xlen_t = long long int]':
> array__to_vector.cpp:341:47:   required from 'arrow::Status 
> arrow::r::Converter_Dictionary::Ingest_some_nulls_Impl(SEXP, const 
> std::shared_ptr&, R_xlen_t, R_xlen_t) const [with Type = 
> arrow::UInt8Type; 

[jira] [Updated] (ARROW-4838) [C++] Implement safe Make constructor

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4838:

Fix Version/s: (was: 0.14.0)

> [C++] Implement safe Make constructor
> -
>
> Key: ARROW-4838
> URL: https://issues.apache.org/jira/browse/ARROW-4838
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The following classes need validating constructors:
> * ArrayData
> * ChunkedArray
> * RecordBatch
> * Column
> * Table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5448) [CI] MinGW build failures on AppVeyor

2019-05-30 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-5448:
---

Assignee: Kouhei Sutou

> [CI] MinGW build failures on AppVeyor
> -
>
> Key: ARROW-5448
> URL: https://issues.apache.org/jira/browse/ARROW-5448
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Kouhei Sutou
>Priority: Blocker
>
> Apparently the Numpy package is broken. See 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/24922425/job/9yoq08uepk5p6dwb
> {code}
> -- Found PythonLibs: C:/msys64/mingw32/lib/libpython3.7m.dll.a
> CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
>   NumPy import failure:
>   Traceback (most recent call last):
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\__init__.py", line 
> 40, in 
>   from . import multiarray
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\multiarray.py", 
> line 12, in 
>   from . import overrides
> File 
> "C:/msys64/mingw32/lib/python3.7/site-packages\numpy\core\overrides.py", line 
> 6, in 
>   from numpy.core._multiarray_umath import (
>   ImportError: DLL load failed: The specified module could not be found.
>   
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4752) [Rust] Add explicit SIMD vectorization for the divide kernel

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4752:

Fix Version/s: (was: 0.14.0)

> [Rust] Add explicit SIMD vectorization for the divide kernel
> 
>
> Key: ARROW-4752
> URL: https://issues.apache.org/jira/browse/ARROW-4752
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4701) [C++] Add JSON chunker benchmarks

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4701:

Fix Version/s: (was: 0.14.0)

> [C++] Add JSON chunker benchmarks
> -
>
> Key: ARROW-4701
> URL: https://issues.apache.org/jira/browse/ARROW-4701
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>
> The JSON chunker is not currently benchmarked or tested, but it is a 
> necessary component of a multithreaded reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4757) [C++] Nested chunked array support

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4757:

Fix Version/s: (was: 0.14.0)

> [C++] Nested chunked array support
> --
>
> Key: ARROW-4757
> URL: https://issues.apache.org/jira/browse/ARROW-4757
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Dear all,
> I'm currently trying to lift the 2GB limit on the python serialization. For 
> this, I implemented a chunked union builder to split the array into smaller 
> arrays.
> However, some of the children of the union array can be ListArrays, which can 
> themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit 
> of a loss how to handle this. In principle I'd like to chunk the children 
> too. However, currently UnionArrays can only have children of type Array, and 
> there is no way to treat a chunked array (which is a vector of Arrays) as an 
> Array to store it as a child of a UnionArray. Any ideas how to best support 
> this use case?
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4709) [C++] Optimize for ordered JSON fields

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4709:

Fix Version/s: (was: 0.14.0)

> [C++] Optimize for ordered JSON fields
> --
>
> Key: ARROW-4709
> URL: https://issues.apache.org/jira/browse/ARROW-4709
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>
> Fields appear consistently ordered in most JSON data in the wild, but the 
> JSON parser currently looks fields up in a hash table. The ordering can 
> probably be exploited to yield better performance when looking up field 
> indices



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4695) [JS] Tests timing out on Travis

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4695:

Fix Version/s: (was: 0.14.0)

> [JS] Tests timing out on Travis
> ---
>
> Key: ARROW-4695
> URL: https://issues.apache.org/jira/browse/ARROW-4695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, JavaScript
>Affects Versions: JS-0.4.0
>Reporter: Brian Hulette
>Priority: Major
>  Labels: ci-failure, travis-ci
>
> Example build: https://travis-ci.org/apache/arrow/jobs/498967250
> JS tests sometimes fail with the following message:
> {noformat}
> > apache-arrow@ test /home/travis/build/apache/arrow/js
> > NODE_NO_WARNINGS=1 gulp test
> [22:14:01] Using gulpfile ~/build/apache/arrow/js/gulpfile.js
> [22:14:01] Starting 'test'...
> [22:14:01] Starting 'test:ts'...
> [22:14:49] Finished 'test:ts' after 47 s
> [22:14:49] Starting 'test:src'...
> [22:15:27] Finished 'test:src' after 38 s
> [22:15:27] Starting 'test:apache-arrow'...
> No output has been received in the last 10m0s, this potentially indicates a 
> stalled build or something wrong with the build itself.
> Check the details on how to adjust your build configuration on: 
> https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
> The build has been terminated
> {noformat}
> I thought maybe we were just running up against some time limit, but that 
> particular build was terminated at 22:25:27, exactly ten minutes after the 
> last output, at 22:15:27. So it does seem like the build is somehow stalling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4668) [C++] Support GCP BigQuery Storage API

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4668:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Support GCP BigQuery Storage API
> --
>
> Key: ARROW-4668
> URL: https://issues.apache.org/jira/browse/ARROW-4668
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> Docs: [https://cloud.google.com/bigquery/docs/reference/storage/] 
> Need to investigate the best way to do this maybe just see if we can build 
> our client on GCP (once a protobuf definition is published to 
> [https://github.com/googleapis/googleapis/tree/master/google)?|https://github.com/googleapis/googleapis/tree/master/google)]
>  
> This will serve as a parent issue, and sub-issues will be added for subtasks 
> if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4677) [Python] serialization does not consider ndarray endianness

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4677:

Fix Version/s: (was: 0.14.0)

> [Python] serialization does not consider ndarray endianness
> ---
>
> Key: ARROW-4677
> URL: https://issues.apache.org/jira/browse/ARROW-4677
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: * pyarrow 0.12.1
> * numpy 1.16.1
> * Python 3.7.0
> * Intel Core i7-7820HQ
> * (macOS 10.13.6)
>Reporter: Gabe Joseph
>Priority: Minor
>
> {{pa.serialize}} does not appear to properly encode the endianness of 
> multi-byte data:
> {code}
> # roundtrip.py 
> import numpy as np
> import pyarrow as pa
> arr = np.array([1], dtype=np.dtype('>i2'))
> buf = pa.serialize(arr).to_buffer()
> result = pa.deserialize(buf)
> print(f"Original: {arr.dtype.str}, deserialized: {result.dtype.str}")
> np.testing.assert_array_equal(arr, result)
> {code}
> {code}
> $ pipenv run python roundtrip.py
> Original: >i2, deserialized:  Traceback (most recent call last):
>   File "roundtrip.py", line 10, in 
> np.testing.assert_array_equal(arr, result)
>   File 
> "/Users/gabejoseph/.local/share/virtualenvs/arrow-roundtrip-1xVSuBtp/lib/python3.7/site-packages/numpy/testing/_private/utils.py",
>  line 896, in assert_array_equal
> verbose=verbose, header='Arrays are not equal')
>   File 
> "/Users/gabejoseph/.local/share/virtualenvs/arrow-roundtrip-1xVSuBtp/lib/python3.7/site-packages/numpy/testing/_private/utils.py",
>  line 819, in assert_array_compare
> raise AssertionError(msg)
> AssertionError: 
> Arrays are not equal
> Mismatch: 100%
> Max absolute difference: 255
> Max relative difference: 0.99609375
>  x: array([1], dtype=int16)
>  y: array([256], dtype=int16)
> {code}
> The data of the deserialized array is identical (big-endian), but the dtype 
> Arrow assigns to it doesn't reflect its endianness (presumably uses the 
> system endianness, which is little).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4633:

Fix Version/s: (was: 0.14.0)

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Priority: Minor
>  Labels: newbie, parquet
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4649) [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4649:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`
> -
>
> Key: ARROW-4649
> URL: https://issues.apache.org/jira/browse/ARROW-4649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration, R
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: nightly, travis-ci
> Fix For: 0.15.0
>
>
> Now that we have an Arrow homebrew formula again and we may want to have it 
> as a simple setup for R Arrow users, we should add a nightly crossbow task 
> that checks whether this still builds fine.
> To implement this, one should write a new travis.yml like 
> [https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/travis.osx.yml]
>  that calls {{brew install apache-arrow --HEAD}}. This task should then be 
> added to https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml so 
> that it is executed as part of the nightly chain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4649) [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4649:

Labels: nightly travis-ci  (was: travis-ci)

> [C++/CI/R] Add nightly job that builds `brew install apache-arrow --HEAD`
> -
>
> Key: ARROW-4649
> URL: https://issues.apache.org/jira/browse/ARROW-4649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration, R
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: nightly, travis-ci
> Fix For: 0.14.0
>
>
> Now that we have an Arrow homebrew formula again and we may want to have it 
> as a simple setup for R Arrow users, we should add a nightly crossbow task 
> that checks whether this still builds fine.
> To implement this, one should write a new travis.yml like 
> [https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/travis.osx.yml]
>  that calls {{brew install apache-arrow --HEAD}}. This task should then be 
> added to https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml so 
> that it is executed as part of the nightly chain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4661) [C++] Consolidate random string generators for use in benchmarks and unittests

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4661:

Fix Version/s: (was: 0.14.0)

> [C++] Consolidate random string generators for use in benchmarks and unittests
> --
>
> Key: ARROW-4661
> URL: https://issues.apache.org/jira/browse/ARROW-4661
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Hatem Helal
>Assignee: Hatem Helal
>Priority: Minor
>
> This was discussed in here:
> [https://github.com/apache/arrow/pull/3721]
> For testing/benchmarking dictionary encoding its useful to control the number 
> of repeated values and it would also be good to optionally include null 
> values.  The ability to provide a custom alphabet would be handy for 
> generating strings with unicode characters.
>  
> Also note that a simple PRNG should be used as the group has observed 
> performance trouble with Mersenne Twister.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4648) [C++/Question] Naming/organizational inconsistencies in cpp codebase

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4648:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++/Question] Naming/organizational inconsistencies in cpp codebase
> 
>
> Key: ARROW-4648
> URL: https://issues.apache.org/jira/browse/ARROW-4648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.15.0
>
>
> Even after my eyes are used to the codebase, I still find the namings and/or 
> code organization inconsistent.
> h2. File Formats
> So arrow already support a couple of file formats, namely parquet, feather, 
> json, csv, orc, but their placement in the codebase is quiet odd:
> - parquet: src/parquet
> - feather: src/arrow/ipc/feather
> - orc: src/arrow/adapters/orc
> - csv: src/arrow/csv
> - json: src/arrow/json
> I might misunderstand the purpose of these sources, but I'd expect them to be 
> organized under the same roof.
> h2. Inter-Process-Communication vs. Flight
> I'd expect flight's functionality from the ipc names. 
> Flight's placement is a bit odd too, because it has its own codename, it 
> should be placed under cpp/src - like parquet, plasma, or gandiva.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4631) [C++] Implement serial version of sort computational kernel

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4631:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Implement serial version of sort computational kernel
> ---
>
> Key: ARROW-4631
> URL: https://issues.apache.org/jira/browse/ARROW-4631
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Areg Melik-Adamyan
>Assignee: Areg Melik-Adamyan
>Priority: Major
>  Labels: analytics
> Fix For: 0.15.0
>
>
> Implement serial version of sort computational kernel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4591) [Rust] Add explicit SIMD vectorization for aggregation ops in "array_ops"

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4591:

Fix Version/s: (was: 0.14.0)

> [Rust] Add explicit SIMD vectorization for aggregation ops in "array_ops"
> -
>
> Key: ARROW-4591
> URL: https://issues.apache.org/jira/browse/ARROW-4591
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4575) [Python] Add Python Flight implementation to integration testing

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4575:

Fix Version/s: (was: 0.14.0)

> [Python] Add Python Flight implementation to integration testing
> 
>
> Key: ARROW-4575
> URL: https://issues.apache.org/jira/browse/ARROW-4575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Integration, Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: flight
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4567) [C++] Convert Scalar values to Array values with length 1

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852594#comment-16852594
 ] 

Wes McKinney commented on ARROW-4567:
-

cc [~fsaintjacques]

> [C++] Convert Scalar values to Array values with length 1
> -
>
> Key: ARROW-4567
> URL: https://issues.apache.org/jira/browse/ARROW-4567
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> A common approach to performing operations on both scalar and array values is 
> to treat a Scalar as an array of length 1. For example, we cannot currently 
> use our Cast kernels to cast a Scalar. It would be senseless to create 
> separate kernel implementations specialized for a single value, and much 
> easier to promote a scalar to an Array, execute the kernel, then unbox the 
> result back into a Scalar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4515) [C++, lint] Use clang-format more efficiently in `check-format` target

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4515:

Fix Version/s: (was: 0.14.0)

> [C++, lint] Use clang-format more efficiently in `check-format` target
> --
>
> Key: ARROW-4515
> URL: https://issues.apache.org/jira/browse/ARROW-4515
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>
> `clang-format` supports command line option `-output-replacements-xml` which 
> (in the case of no required changes) outputs:
> ```
> 
> 
> 
> ```
> Using this option during `check-format` instead of using python to compute a 
> diff between formatted and on-disk should speed up that target significantly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4534) [Rust] Build JSON reader for reading record batches from line-delimited JSON files

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4534:

Fix Version/s: (was: 0.14.0)

> [Rust] Build JSON reader for reading record batches from line-delimited JSON 
> files
> --
>
> Key: ARROW-4534
> URL: https://issues.apache.org/jira/browse/ARROW-4534
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Priority: Major
>
> Similar to ARROW-694, this is an umbrella issue for supporting reading JSON 
> line-delimited files in Arrow.
> I have a reference implementation at 
> https://github.com/nevi-me/rust-dataframe/blob/io/json/src/io/json.rs where 
> I'm building a Rust-based dataframe library using Arrow.
> I'd like us to have feature parity with CPP at some point.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4470:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Pyarrow using considerable more memory when reading partitioned 
> Parquet file
> -
>
> Key: ARROW-4470
> URL: https://issues.apache.org/jira/browse/ARROW-4470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: Ivan SPM
>Priority: Major
>  Labels: datasets, parquet
> Fix For: 0.15.0
>
>
> Hi,
> I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
> with the following structure:
> {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}
> {{/data/myparquettable/year=2016/myfile_2.prt}}
> {{/data/myparquettable/year=2016/myfile_3.prt}}
> {{/data/myparquettable/year=2017}}
> {{/data/myparquettable/year=2017/myfile_1.prt}}
> {{/data/myparquettable/year=2017/myfile_2.prt}}
> {{/data/myparquettable/year=2017/myfile_3.prt}}
> and so on. I need to work with one partition, so I copied one partition to a 
> local filesystem:
> {{hdfs fs -get /data/myparquettable/year=2017 /local/}}
> so now I have some data on the local disk:
> {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}
> etc.I tried to read it using Pyarrow:
> {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}
> and it starts reading. The problem is that the local Parquet files are around 
> 15GB total, and I blew up my machine memory a couple of times because when 
> reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
> how much it will take because it never finishes. Is this expected? Is there a 
> workaround?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4479) [Plasma] Add S3 as external store for Plasma

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852589#comment-16852589
 ] 

Wes McKinney commented on ARROW-4479:
-

What is the status of this project?

> [Plasma] Add S3 as external store for Plasma
> 
>
> Key: ARROW-4479
> URL: https://issues.apache.org/jira/browse/ARROW-4479
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Plasma
>Affects Versions: 0.12.0
>Reporter: Anurag Khandelwal
>Assignee: Anurag Khandelwal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Adding S3 as an external store will allow objects to be evicted to S3 when 
> Plasma runs out of memory capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4482) [Website] Add blog archive page

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4482:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Website] Add blog archive page
> ---
>
> Key: ARROW-4482
> URL: https://issues.apache.org/jira/browse/ARROW-4482
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> There's no easy way to get a bulleted list of all blog posts on the Arrow 
> website. See example archive on my personal blog 
> http://wesmckinney.com/archives.html
> It would be great to have such a generated archive on our website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4465) [Rust] [DataFusion] Add support for ORDER BY

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4465:

Fix Version/s: (was: 0.14.0)

> [Rust] [DataFusion] Add support for ORDER BY
> 
>
> Key: ARROW-4465
> URL: https://issues.apache.org/jira/browse/ARROW-4465
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to specify an ORDER BY clause on my query.
> Work involved:
>  * Add OrderBy to LogicalPlan enum
>  * Write query planner code to translate SQL AST to OrderBy (SQL parser that 
> we use already supports parsing ORDER BY)
>  * Implement SortRelation
> My high level thoughts on implementing the SortRelation:
>  * Create Arrow array of uint32 same size as batch and populate such that 
> each element contains its own index i.e. array will be 0, 1, 2, 3
>  * Find a Rust crate for sorting that allows us to provide our own comparison 
> lambda
>  * Implement the comparison logic (probably can reuse existing execution code 
> - see filter.rs for how it implements comparison expressions)
>  * Use index array to store the result of the sort i.e. no need to rewrite 
> the whole batch, just the index
>  * Rewrite the batch after the sort has completed
> It would also be good to see how Gandiva has implemented this
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4473) [Website] Add instructions to do a test-deploy of Arrow website and fix bugs

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4473:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Website] Add instructions to do a test-deploy of Arrow website and fix bugs
> 
>
> Key: ARROW-4473
> URL: https://issues.apache.org/jira/browse/ARROW-4473
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This will help with testing and proofing the website.
> I have noticed that there are bugs in the website when the baseurl is not a 
> foo.bar.baz, e.g. if you deploy at root foo.bar.baz/test-site many images and 
> links are broken



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4470:

Labels: datasets parquet  (was: parquet)

> [Python] Pyarrow using considerable more memory when reading partitioned 
> Parquet file
> -
>
> Key: ARROW-4470
> URL: https://issues.apache.org/jira/browse/ARROW-4470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: Ivan SPM
>Priority: Major
>  Labels: datasets, parquet
> Fix For: 0.14.0
>
>
> Hi,
> I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
> with the following structure:
> {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}
> {{/data/myparquettable/year=2016/myfile_2.prt}}
> {{/data/myparquettable/year=2016/myfile_3.prt}}
> {{/data/myparquettable/year=2017}}
> {{/data/myparquettable/year=2017/myfile_1.prt}}
> {{/data/myparquettable/year=2017/myfile_2.prt}}
> {{/data/myparquettable/year=2017/myfile_3.prt}}
> and so on. I need to work with one partition, so I copied one partition to a 
> local filesystem:
> {{hdfs fs -get /data/myparquettable/year=2017 /local/}}
> so now I have some data on the local disk:
> {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}
> etc.I tried to read it using Pyarrow:
> {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}
> and it starts reading. The problem is that the local Parquet files are around 
> 15GB total, and I blew up my machine memory a couple of times because when 
> reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
> how much it will take because it never finishes. Is this expected? Is there a 
> workaround?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4447) [C++] Investigate dynamic linking for libthift

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4447.
-
Resolution: Fixed
  Assignee: Uwe L. Korn

Thrift is now dynamically linked

> [C++] Investigate dynamic linking for libthift
> --
>
> Key: ARROW-4447
> URL: https://issues.apache.org/jira/browse/ARROW-4447
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> We're currently only linking statically against {{libthrift}} . Distributions 
> would often prefer a dynamic linkage to libraries where possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4453) [Python] Create Cython wrappers for SparseTensor

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4453:

Fix Version/s: (was: 0.14.0)

> [Python] Create Cython wrappers for SparseTensor
> 
>
> Key: ARROW-4453
> URL: https://issues.apache.org/jira/browse/ARROW-4453
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Assignee: Rok Mihevc
>Priority: Minor
>
> We should have cython wrappers for [https://github.com/apache/arrow/pull/2546]
> This is related to support for 
> https://issues.apache.org/jira/browse/ARROW-4223 and 
> https://issues.apache.org/jira/browse/ARROW-4224
> I imagine the code would be similar to 
> https://github.com/apache/arrow/blob/5a502d281545402240e818d5fd97a9aaf36363f2/python/pyarrow/array.pxi#L748



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4439) [C++] Improve FindBrotli.cmake

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852586#comment-16852586
 ] 

Wes McKinney commented on ARROW-4439:
-

[~rip@gmail.com] is this OK in master now?

> [C++] Improve FindBrotli.cmake
> --
>
> Key: ARROW-4439
> URL: https://issues.apache.org/jira/browse/ARROW-4439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4759) [Rust] [DataFusion] It should be possible to share an execution context between threads

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4759:

Fix Version/s: (was: 0.14.0)

> [Rust] [DataFusion] It should be possible to share an execution context 
> between threads
> ---
>
> Key: ARROW-4759
> URL: https://issues.apache.org/jira/browse/ARROW-4759
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Affects Versions: 0.12.0
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> I am working on a PR for this now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4429) Add git rebase tips to the 'Contributing' page in the developer docs

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4429:

Fix Version/s: (was: 0.14.0)

> Add git rebase tips to the 'Contributing' page in the developer docs
> 
>
> Key: ARROW-4429
> URL: https://issues.apache.org/jira/browse/ARROW-4429
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> A recent discussion on the listserv (link below) asked about how contributors 
> should handle rebasing. It would be helpful if the tips made it into the 
> developer documentation somehow. I suggest in the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  page—currently a wiki, but hopefully eventually part of the Sphinx docs 
> ARROW-4427.
> Here is the relevant thread:
> [https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5453) [C++] Just-released cmake-format 0.5.2 breaks the build

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5453.
-
Resolution: Fixed

Issue resolved by pull request 4423
[https://github.com/apache/arrow/pull/4423]

> [C++] Just-released cmake-format 0.5.2 breaks the build
> ---
>
> Key: ARROW-5453
> URL: https://issues.apache.org/jira/browse/ARROW-5453
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It seems we should always pin the cmake-format version until the developers 
> stop changing the formatting algorithm



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5455) [Rust] Build broken by 2019-05-30 Rust nightly

2019-05-30 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5455:
---

 Summary: [Rust] Build broken by 2019-05-30 Rust nightly
 Key: ARROW-5455
 URL: https://issues.apache.org/jira/browse/ARROW-5455
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Wes McKinney
 Fix For: 0.14.0


Seem example failed build

https://travis-ci.org/apache/arrow/jobs/539477452



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5455) [Rust] Build broken by 2019-05-30 Rust nightly

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5455:

Priority: Blocker  (was: Major)

> [Rust] Build broken by 2019-05-30 Rust nightly
> --
>
> Key: ARROW-5455
> URL: https://issues.apache.org/jira/browse/ARROW-5455
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Wes McKinney
>Priority: Blocker
> Fix For: 0.14.0
>
>
> Seem example failed build
> https://travis-ci.org/apache/arrow/jobs/539477452



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4419) [Flight] Deal with body buffers in FlightData

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852580#comment-16852580
 ] 

Wes McKinney commented on ARROW-4419:
-

[~lidavidm] where does this issue stand?

> [Flight] Deal with body buffers in FlightData
> -
>
> Key: ARROW-4419
> URL: https://issues.apache.org/jira/browse/ARROW-4419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Minor
>  Labels: flight
> Fix For: 0.14.0
>
>
> The Java implementation will fail to decode a schema message if the message 
> also contains (empty) body buffers (see ArrowMessage.asSchema's precondition 
> checks). However, clients using default Protobuf serialization will likely 
> write an empty body buffer by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4398) [Python] Add benchmarks for Arrow<>Parquet BYTE_ARRAY serialization (read and write)

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4398:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Add benchmarks for Arrow<>Parquet BYTE_ARRAY serialization (read and 
> write)
> 
>
> Key: ARROW-4398
> URL: https://issues.apache.org/jira/browse/ARROW-4398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> This is follow-on work to PARQUET-1508, so we can monitor the performance of 
> this operation over time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4369) [Packaging] Release verification script should test linux packages via docker

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852579#comment-16852579
 ] 

Wes McKinney commented on ARROW-4369:
-

[~kszucs] any thoughts about this for 0.14? We can also postpone

> [Packaging] Release verification script should test linux packages via docker
> -
>
> Key: ARROW-4369
> URL: https://issues.apache.org/jira/browse/ARROW-4369
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>
> It shouldn't be too hard to create a verification script which checks the 
> linux packages. This could prevent issues like [ARROW-4368] / 
> [https://github.com/apache/arrow/issues/3476]
> I suggest to separate the current verification script into one which verifies 
> the source release artifact and another which verifies the binaries:
>  * checksum and signatures as is right now
>  * install linux packages on multiple distros via docker
> We could test wheels and conda packages as well, but in follow-up PRs.
>  
> cc [~kou]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4409) [C++] Enable arrow::ipc internal JSON reader to read from a file path

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4409:

Fix Version/s: (was: 0.14.0)

> [C++] Enable arrow::ipc internal JSON reader to read from a file path
> -
>
> Key: ARROW-4409
> URL: https://issues.apache.org/jira/browse/ARROW-4409
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Minor
>
> This may make tests easier to write. Currently an input buffer is required, 
> so reading from a file requires some boilerplate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4343) [C++] Add as complete as possible Ubuntu Trusty / 14.04 build to docker-compose setup

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852576#comment-16852576
 ] 

Wes McKinney commented on ARROW-4343:
-

What does it mean now that Ubuntu Trusty is no longer an LTS release?

> [C++] Add as complete as possible Ubuntu Trusty / 14.04 build to 
> docker-compose setup
> -
>
> Key: ARROW-4343
> URL: https://issues.apache.org/jira/browse/ARROW-4343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Until we formally stop supporting Trusty it would be useful to be able to 
> verify in Docker that builds work there. I still have an Ubuntu 14.04 machine 
> that I use (and I've been filing bugs that I find on it) but not sure for how 
> much longer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4350) [Python] nested numpy arrays

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852577#comment-16852577
 ] 

Wes McKinney commented on ARROW-4350:
-

[~jorisvandenbossche] could you take a look and maybe clarify the issue title 
etc.?

> [Python] nested numpy arrays
> 
>
> Key: ARROW-4350
> URL: https://issues.apache.org/jira/browse/ARROW-4350
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
>Reporter: yu peng
>Priority: Major
> Fix For: 0.14.0
>
>
> {code:java}
> In [19]: df = pd.DataFrame({'a': [[[1], [2]], [[2], [3]]], 'b': [1, 2]})
> In [20]: df.iloc[0].to_dict()
> Out[20]: {'a': [[1], [2]], 'b': 1}
> In [21]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()
> Out[21]: {'a': array([array([1]), array([2])], dtype=object), 'b': 1}
> In [24]: np.array(df.iloc[0].to_dict()['a']).shape
> Out[24]: (2, 1)
> In [25]: pa.Table.from_pandas(df).to_pandas().iloc[0].to_dict()['a'].shape
> Out[25]: (2,)
> {code}
> Adding extra array type is not functioning as expected. 
>  
> More importantly, this would fail
>  
> {code:java}
> In [108]: df = pd.DataFrame({'a': [[[1, 2],[2, 3]], [[1,2], [2, 3]]], 'b': 
> [[1, 2],[2, 3]]})
> In [109]: df
> Out[109]:
> a b
> 0 [[1, 2], [2, 3]] [1, 2]
> 1 [[1, 2], [2, 3]] [2, 3]
> In [110]: pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> ---
> ArrowTypeError Traceback (most recent call last)
>  in ()
> > 1 pa.Table.from_pandas(pa.Table.from_pandas(df).to_pandas())
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/table.pxi
>  in pyarrow.lib.Table.from_pandas()
> 1215 
> 1216 """
> -> 1217 names, arrays, metadata = pdcompat.dataframe_to_arrays(
> 1218 df,
> 1219 schema=schema,
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 379 arrays = [convert_column(c, t)
> 380 for c, t in zip(columns_to_convert,
> --> 381 convert_types)]
> 382 else:
> 383 from concurrent import futures
> /Users/pengyu/.pyenv/virtualenvs/starscream/2.7.11/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc
>  in convert_column(col, ty)
> 374 e.args += ("Conversion failed for column {0!s} with type {1!s}"
> 375 .format(col.name, col.dtype),)
> --> 376 raise e
> 377
> 378 if nthreads == 1:
> ArrowTypeError: ('only size-1 arrays can be converted to Python scalars', 
> 'Conversion failed for column a with type object')
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4324) [Python] Array dtype inference incorrect when created from list of mixed numpy scalars

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852574#comment-16852574
 ] 

Wes McKinney commented on ARROW-4324:
-

[~jorisvandenbossche] could you take a look?

> [Python] Array dtype inference incorrect when created from list of mixed 
> numpy scalars
> --
>
> Key: ARROW-4324
> URL: https://issues.apache.org/jira/browse/ARROW-4324
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Keith Kraus
>Priority: Minor
> Fix For: 0.14.0
>
>
> Minimal reproducer:
> {code:python}
> import pyarrow as pa
> import numpy as np
> test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)]
> test_array = pa.array(test_list)
> # Expected
> # test_array
> # 
> # [
> #   10,
> #   0.5
> # ]
> # Got
> # test_array
> # 
> # [
> #   10,
> #   0
> # ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4333:

Fix Version/s: (was: 0.14.0)

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4337) [C#] Array / RecordBatch Builder Fluent API

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4337:

Fix Version/s: (was: 0.14.0)

> [C#] Array / RecordBatch Builder Fluent API
> ---
>
> Key: ARROW-4337
> URL: https://issues.apache.org/jira/browse/ARROW-4337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Chris Hutchinson
>Assignee: Chris Hutchinson
>Priority: Major
>  Labels: c#, pull-request-available
>   Original Estimate: 12h
>  Time Spent: 5h 10m
>  Remaining Estimate: 6h 50m
>
> Implement a fluent API for building arrays and record batches from Arrow 
> buffers, flat arrays, spans, enumerables, etc.
> A future implementation could extend this API with support for ADO.NET 
> DataTables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4309) [Release] gen_apidocs docker-compose task is out of date

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4309:

Fix Version/s: (was: 0.14.0)

> [Release] gen_apidocs docker-compose task is out of date
> 
>
> Key: ARROW-4309
> URL: https://issues.apache.org/jira/browse/ARROW-4309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, Documentation
>Reporter: Wes McKinney
>Priority: Major
>  Labels: docker
>
> This needs to be updated to build with CUDA support (which in turn will 
> require the host machine to have nvidia-docker), among other things



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4302) [C++] Add OpenSSL to C++ build toolchain

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4302.
-
Resolution: Fixed

> [C++] Add OpenSSL to C++ build toolchain
> 
>
> Key: ARROW-4302
> URL: https://issues.apache.org/jira/browse/ARROW-4302
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is needed for encryption support for Parquet, among other things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852571#comment-16852571
 ] 

Wes McKinney commented on ARROW-4301:
-

[~pravindra] any ideas about this? This will get us again in 0.14 if it is not 
fixed

> [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva 
> submodule
> ---
>
> Key: ARROW-4301
> URL: https://issues.apache.org/jira/browse/ARROW-4301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Reporter: Wes McKinney
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550.
>  This is breaking the build so I'm going to patch manually



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4286) [C++/R] Namespace vendored Boost

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4286:

Fix Version/s: (was: 0.14.0)

> [C++/R] Namespace vendored Boost
> 
>
> Key: ARROW-4286
> URL: https://issues.apache.org/jira/browse/ARROW-4286
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Packaging, R
>Reporter: Uwe L. Korn
>Priority: Major
>
> For R, we vendor Boost and thus also include the symbols privately in our 
> modules. While they are private, some things like virtual destructors can 
> still interfere with other packages that vendor Boost. We should also 
> namespace the vendored Boost as we do in the manylinux1 packaging: 
> https://github.com/apache/arrow/blob/0f8bd747468dd28c909ef823bed77d8082a5b373/python/manylinux1/scripts/build_boost.sh#L28



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4217) [Plasma] Remove custom object metadata

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4217:

Fix Version/s: (was: 0.14.0)

> [Plasma] Remove custom object metadata
> --
>
> Key: ARROW-4217
> URL: https://issues.apache.org/jira/browse/ARROW-4217
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Affects Versions: 0.11.1
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Minor
>
> Currently, Plasma supports custom metadata for objects. This doesn't seem to 
> be used at the moment, and removing it will simplify the interface and 
> implementation of plasma. Removing the custom metadata will also make 
> eviction to other blob stores easier (most other stores don't support custom 
> metadata).
> My personal use case was to store arrow schemata in there, but they are now 
> stored as part of the object itself.
> If nobody else is using this, I'd suggest removing it. If people really want 
> metadata, they could always store it as a separate object if desired.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4220) [Python] Add buffered input and output stream ASV benchmarks with simulated high latency IO

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852570#comment-16852570
 ] 

Wes McKinney commented on ARROW-4220:
-

cc [~jorisvandenbossche]

> [Python] Add buffered input and output stream ASV benchmarks with simulated 
> high latency IO
> ---
>
> Key: ARROW-4220
> URL: https://issues.apache.org/jira/browse/ARROW-4220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Follow up to ARROW-3126



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4283) [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4283:

Fix Version/s: (was: 0.14.0)

> [Python] Should RecordBatchStreamReader/Writer be AsyncIterable?
> 
>
> Key: ARROW-4283
> URL: https://issues.apache.org/jira/browse/ARROW-4283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Paul Taylor
>Priority: Minor
>
> Filing this issue after a discussion today with [~xhochy] about how to 
> implement streaming pyarrow http services. I had attempted to use both Flask 
> and [aiohttp|https://aiohttp.readthedocs.io/en/stable/streams.html]'s 
> streaming interfaces because they seemed familiar, but no dice. I have no 
> idea how hard this would be to add -- supporting all the asynciterable 
> primitives in JS was non-trivial.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4259) [Plasma] CI failure in test_plasma_tf_op

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4259:

Fix Version/s: (was: 0.14.0)

> [Plasma] CI failure in test_plasma_tf_op
> 
>
> Key: ARROW-4259
> URL: https://issues.apache.org/jira/browse/ARROW-4259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma, Continuous Integration, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: ci-failure
>
> Recently-appeared failure on master:
> https://travis-ci.org/apache/arrow/jobs/479378188#L7108



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4208:

Labels: filesystem s3  (was: s3)

> [CI/Python] Have automatized tests for S3
> -
>
> Key: ARROW-4208
> URL: https://issues.apache.org/jira/browse/ARROW-4208
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, s3
> Fix For: 0.14.0
>
>
> Currently We don't run S3 integration tests regularly. 
> Possible solutions:
> - mock it within python/pytest
> - simply run the s3 tests with an S3 credential provided
> - create a hdfs-integration like docker-compose setup and run an S3 mock 
> server (e.g.: https://github.com/adobe/S3Mock, 
> https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, 
> https://github.com/jserver/mock-s3)
> For more see discussion https://github.com/apache/arrow/pull/3286



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4208) [CI/Python] Have automatized tests for S3

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4208:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [CI/Python] Have automatized tests for S3
> -
>
> Key: ARROW-4208
> URL: https://issues.apache.org/jira/browse/ARROW-4208
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, s3
> Fix For: 0.15.0
>
>
> Currently We don't run S3 integration tests regularly. 
> Possible solutions:
> - mock it within python/pytest
> - simply run the s3 tests with an S3 credential provided
> - create a hdfs-integration like docker-compose setup and run an S3 mock 
> server (e.g.: https://github.com/adobe/S3Mock, 
> https://github.com/jubos/fake-s3, https://github.com/gaul/s3proxy, 
> https://github.com/jserver/mock-s3)
> For more see discussion https://github.com/apache/arrow/pull/3286



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4202) [Gandiva] use ArrayFromJson in tests

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4202:

Fix Version/s: (was: 0.14.0)

> [Gandiva] use ArrayFromJson in tests
> 
>
> Key: ARROW-4202
> URL: https://issues.apache.org/jira/browse/ARROW-4202
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Pindikura Ravindra
>Priority: Major
>
> Most of the gandiva tests use wrappers over ArrowFromVector. These will 
> become a lot more readable if we switch to ArrayFromJSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4146) [C++] Extend visitor functions to include ArrayBuilder and allow callable visitors

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4146:

Fix Version/s: (was: 0.14.0)

> [C++] Extend visitor functions to include ArrayBuilder and allow callable 
> visitors
> --
>
> Key: ARROW-4146
> URL: https://issues.apache.org/jira/browse/ARROW-4146
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> In addition to accepting objects with Visit methods for the visited type, 
> {{Visit(Array|Type)}} and {{Visit(Array|Type)Inline}} should accept objects 
> with overloaded call operators.
> In addition for inline visitation if a visitor can only visit one of the 
> potential unboxings then this can be detected at compile time and the full 
> type_id switch can be avoided (if the unboxed object cannot be visited then 
> do nothing). For example:
> {code}
> VisitTypeInline(some_type, [](const StructType& s) {
>   // only execute this if some_type.id() == Type::STRUCT
> });
> {code}
> Finally, visit functions should be added for visiting ArrayBuilders



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4201) [C++][Gandiva] integrate test utils with arrow

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4201:

Fix Version/s: (was: 0.14.0)

> [C++][Gandiva] integrate test utils with arrow
> --
>
> Key: ARROW-4201
> URL: https://issues.apache.org/jira/browse/ARROW-4201
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Pindikura Ravindra
>Priority: Major
>
> The following tasks to be addressed as part of this Jira :
>  # move (or consolidate) data generators in generate_data.h to arrow
>  # move convenience fns in gandiva/tests/test_util.h to arrow
>  # move (or consolidate) EXPECT_ARROW_* fns to arrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4095:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Implement optimizations for dictionary unification where dictionaries 
> are prefixes of the unified dictionary
> --
>
> Key: ARROW-4095
> URL: https://issues.apache.org/jira/browse/ARROW-4095
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> In the event that the unified dictionary contains other dictionaries as 
> prefixes (e.g. as the result of delta dictionaries), we can avoid memory 
> allocation and index transposition.
> See discussion at 
> https://github.com/apache/arrow/pull/3165#discussion_r243020982



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4133) [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing instead of aborting

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4133:

Fix Version/s: (was: 0.14.0)

> [C++/Python] ORC adapter should fail gracefully if /etc/timezone is missing 
> instead of aborting
> ---
>
> Key: ARROW-4133
> URL: https://issues.apache.org/jira/browse/ARROW-4133
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: orc
>
> The following core was genereted by nightly build: 
> https://travis-ci.org/kszucs/crossbow/builds/473397855
> {code}
> Core was generated by `/opt/conda/bin/python /opt/conda/bin/pytest -v 
> --pyargs pyarrow'.
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> 51  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> [Current thread is 1 (Thread 0x7fea61f9e740 (LWP 179))]
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7fea608c8801 in __GI_abort () at abort.c:79
> #2  0x7fea4b3483df in __gnu_cxx::__verbose_terminate_handler ()
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
> #3  0x7fea4b346b16 in __cxxabiv1::__terminate (handler=)
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
> #4  0x7fea4b346b4c in std::terminate ()
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
> #5  0x7fea4b346d28 in __cxxabiv1::__cxa_throw (obj=0x2039220,
> tinfo=0x7fea494803d0 ,
> dest=0x7fea49087e52 )
> at 
> /opt/conda/conda-bld/compilers_linux-64_1534514838838/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
> #6  0x7fea49086824 in orc::getTimezoneByFilename (filename=...)
> at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:704
> #7  0x7fea490868d2 in orc::getLocalTimezone () at 
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Timezone.cc:713   
>   
> #8  0x7fea49063e59 in 
> orc::RowReaderImpl::RowReaderImpl (this=0x204fe30, _contents=..., opts=...)
> at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:185
> #9  0x7fea4906651e in orc::ReaderImpl::createRowReader (this=0x1fb41b0, 
> opts=...)
> at /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Reader.cc:630
> #10 0x7fea48c2d904 in 
> arrow::adapters::orc::ORCFileReader::Impl::ReadSchema (this=0x1270600, 
> opts=..., 
>
> out=0x7ffe0ccae7b0) at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:264
> #11 0x7fea48c2e18d in arrow::adapters::orc::ORCFileReader::Impl::Read 
> (this=0x1270600, out=0x7ffe0ccaea00)
> at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:302
> #12 0x7fea48c2a8b9 in arrow::adapters::orc::ORCFileReader::Read 
> (this=0x1e14d10, out=0x7ffe0ccaea00)
> at /arrow/cpp/src/arrow/adapters/orc/adapter.cc:697   
>   
>   
> #13 0x7fea48218c9d in __pyx_pf_7pyarrow_4_orc_9ORCReader_12read 
> (__pyx_v_self=0x7fea43de8688,
> __pyx_v_include_indices=0x7fea61d07b70 <_Py_NoneStruct>) at _orc.cpp:3865
> #14 0x7fea48218b31 in __pyx_pw_7pyarrow_4_orc_9ORCReader_13read 
> (__pyx_v_self=0x7fea43de8688,
> __pyx_args=0x7fea61f5e048, __pyx_kwds=0x7fea444f78b8) at _orc.cpp:3813
> #15 0x7fea61910cbd in _PyCFunction_FastCallDict 
> (func_obj=func_obj@entry=0x7fea444b9558,
> args=args@entry=0x7fea44a40fa8, nargs=nargs@entry=0, 
> kwargs=kwargs@entry=0x7fea444f78b8)
> at Objects/methodobject.c:231
> #16 0x7fea61910f16 in _PyCFunction_FastCallKeywords 
> (func=func@entry=0x7fea444b9558,
> stack=stack@entry=0x7fea44a40fa8, nargs=0, 
> kwnames=kwnames@entry=0x7fea47d81d30) at Objects/methodobject.c:294
> #17 0x7fea619aa0da in call_function 
> (pp_stack=pp_stack@entry=0x7ffe0ccaecf0, oparg=,
> kwnames=kwnames@entry=0x7fea47d81d30) at Python/ceval.c:4837
> #18 0x7fea619abb46 in _PyEval_EvalFrameDefault (f=, 
> throwflag=)
> at Python/ceval.c:3351
> #19 0x7fea619a9cde in _PyEval_EvalCodeWithName (_co=0x7fea47d9f6f0, 
> 

[jira] [Updated] (ARROW-4090) [Python] Table.flatten() doesn't work recursively

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4090:

Fix Version/s: (was: 0.14.0)

> [Python] Table.flatten() doesn't work recursively
> -
>
> Key: ARROW-4090
> URL: https://issues.apache.org/jira/browse/ARROW-4090
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Francisco Sanchez
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that the pyarrow.Table.flatten() function is not working recursively 
> nor providing a parameter to do it.
> {code}
> test1c_data = {'level1-A': 'abc',
>'level1-B': 112233,
>'level1-C': {'x': 123.111, 'y': 123.222, 'z': 123.333}
>   }
> test1c_type = pa.struct([('level1-A', pa.string()),
>  ('level1-B', pa.int32()),
>  ('level1-C', pa.struct([('x', pa.float64()),
>  ('y', pa.float64()),
>  ('z', pa.float64())
> ]))
> ])
> test1c_array = pa.array([test1c_data]*5, type=test1c_type)
> test1c_table = pa.Table.from_arrays([test1c_array], names=['msg']) 
> print('{}\n\n{}\n\n{}'.format(test1c_table.schema,
>   test1c_table.flatten().schema,
>   test1c_table.flatten().flatten().schema))
> {code}
> output:
> {quote}msg: struct double, y: double, z: double>>
>  child 0, level1-A: string
>  child 1, level1-B: int32
>  child 2, level1-C: struct
>  child 0, x: double
>  child 1, y: double
>  child 2, z: double
> msg.level1-A: string
>  msg.level1-B: int32
>  msg.level1-C: struct
>  child 0, x: double
>  child 1, y: double
>  child 2, z: double
> msg.level1-A: string
>  msg.level1-B: int32
>  msg.level1-C.x: double
>  msg.level1-C.y: double
>  msg.level1-C.z: double
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852569#comment-16852569
 ] 

Wes McKinney commented on ARROW-4083:
-

I think this could be addressed at the dataframe level, removing from any 
milestone for now

> [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense 
> Array (of the dictionary type)
> -
>
> Key: ARROW-4083
> URL: https://issues.apache.org/jira/browse/ARROW-4083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> In some applications we may receive a stream of some dictionary encoded data 
> followed by some non-dictionary encoded data. For example this happens in 
> Parquet files when the dictionary reaches a certain configurable size 
> threshold.
> We should think about how we can model this in our in-memory data structures, 
> and how it can flow through to relevant computational components (i.e. 
> certain data flow observers -- like an Aggregation -- might need to be able 
> to process either a dense or dictionary encoded version of a particular array 
> in the same stream)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4083:

Labels: dataframe  (was: )

> [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense 
> Array (of the dictionary type)
> -
>
> Key: ARROW-4083
> URL: https://issues.apache.org/jira/browse/ARROW-4083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 0.14.0
>
>
> In some applications we may receive a stream of some dictionary encoded data 
> followed by some non-dictionary encoded data. For example this happens in 
> Parquet files when the dictionary reaches a certain configurable size 
> threshold.
> We should think about how we can model this in our in-memory data structures, 
> and how it can flow through to relevant computational components (i.e. 
> certain data flow observers -- like an Aggregation -- might need to be able 
> to process either a dense or dictionary encoded version of a particular array 
> in the same stream)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4083:

Fix Version/s: (was: 0.14.0)

> [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense 
> Array (of the dictionary type)
> -
>
> Key: ARROW-4083
> URL: https://issues.apache.org/jira/browse/ARROW-4083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> In some applications we may receive a stream of some dictionary encoded data 
> followed by some non-dictionary encoded data. For example this happens in 
> Parquet files when the dictionary reaches a certain configurable size 
> threshold.
> We should think about how we can model this in our in-memory data structures, 
> and how it can flow through to relevant computational components (i.e. 
> certain data flow observers -- like an Aggregation -- might need to be able 
> to process either a dense or dictionary encoded version of a particular array 
> in the same stream)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4076) [Python] schema validation and filters

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4076:

Labels: datasets easyfix parquet pull-request-available  (was: easyfix 
parquet pull-request-available)

> [Python] schema validation and filters
> --
>
> Key: ARROW-4076
> URL: https://issues.apache.org/jira/browse/ARROW-4076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Priority: Minor
>  Labels: datasets, easyfix, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently [schema 
> validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
>  of {{ParquetDataset}} takes place before filtering. This may raise a 
> {{ValueError}} if the schema is different in some dataset pieces, even if 
> these pieces would be subsequently filtered out. I think validation should 
> happen after filtering to prevent such spurious errors:
> {noformat}
> --- a/pyarrow/parquet.py  
> +++ b/pyarrow/parquet.py  
> @@ -878,13 +878,13 @@
>  if split_row_groups:
>  raise NotImplementedError("split_row_groups not yet implemented")
>  
> -if validate_schema:
> -self.validate_schemas()
> -
>  if filters is not None:
>  filters = _check_filters(filters)
>  self._filter(filters)
>  
> +if validate_schema:
> +self.validate_schemas()
> +
>  def validate_schemas(self):
>  open_file = self._get_open_file_func()
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4057) [Python] Revamp handling of file URIs in pyarrow.parquet

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4057:

Fix Version/s: (was: 0.14.0)

> [Python] Revamp handling of file URIs in pyarrow.parquet
> 
>
> Key: ARROW-4057
> URL: https://issues.apache.org/jira/browse/ARROW-4057
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> The way this is being handled currently is pretty brittle. If the HDFS 
> cluster being used to run the unit tests does not support writes from 
> {{$USER}} then the tests fail (e.g. the only permissioned user in the 
> docker-compose cluster is "root", so the unit tests cannot be run)
> I'm inserting various hacks to get the tests passing for now, but they are 
> temporary. There is code relating to path and URI handling spread throughout 
> the parquet module; it would be much better to centralize and clean this up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4067) [C++] RFC: standardize ArrayBuilder subclasses

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4067:

Fix Version/s: (was: 0.14.0)

> [C++] RFC: standardize ArrayBuilder subclasses
> --
>
> Key: ARROW-4067
> URL: https://issues.apache.org/jira/browse/ARROW-4067
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: usability
>
> Each builder supports different and frequently differently named methods for 
> appending. It should be possible to establish a more consistent convention, 
> which would alleviate dev confusion and simplify generics.
> For example, let all Builders be required to define at minimum:
>  * {{Reserve(int64_t)}}
>  * a nested type named {{Scalar}}, which is the canonical scalar appended to 
> this builder. Append other types may be supported for convenience.
>  * {{UnsafeAppend(Scalar)}}
>  * {{UnsafeAppendNull()}}
> The other methods described below can be overridden if an optimization is 
> available or left defaulted (a CRTP helper can contain the default 
> implementations, for example {{Append(Scalar)}} would simply be a call to 
> Reserve then UnsafeAppend.
> In addition to their unsafe equivalents, {{Append(Scalar)}} and 
> {{AppendNull()}} should be available for appending without manual capacity 
> maintenance.
> It is not necessary for the rest of this RFC, but it would simplify builders 
> further if scalar append methods always had a single argument. For example, 
> this would mean abolishing {{BinaryBuilder::Append(const uint8_t*, int32_t)}} 
> in favor of {{BinaryBuilder::Append(basic_string_view)}}. There's no 
> runtime overhead involved in this change, and developers who have a pointer 
> and a length instead of a view can just construct one without boilerplate 
> using brace initialization: {code}b->Append({pointer, length});{code}
> Unsafe and safe methods should be provided for appending multiple values as 
> well. The default implementation will be a trivial loop but if optimizations 
> are available then this could be overridden (for example instead of copying 
> bits one by one into a BooleanBuilder, bytes could be memcpy'd). Append 
> methods for multiple values should accept two arguments, the first of which 
> contains values and the second of which defines validity. The canonical 
> multiple append method has signature {{Status(array_view values, 
> const uint8_t* valid_bytes)}}, but other overloads and helpers could be 
> provided as well:
> {code}
> b->Append({{1, 3, 4}}, all_valid); // append values with no nulls
> b->Append({{1, 3, 4}}, bool_vector); // use the elements of a vector 
> for validity
> b->Append({{1, 3, 4}}, bits(ptr)); // interpret ptr as a buffer of valid 
> bits, rather than valid bytes
> {code}
> Builders of nested types currently require developers to write boilerplate 
> wrangling the child builders. This could be alleviated by letting nested 
> builders' append methods return a helper as an output argument:
> {code}
> ListBuilder::List lst;
> RETURN_NOT_OK(list_builder.Append()); // ListBuilder::Scalar == 
> ListBuilder::ListBase*
> RETURN_NOT_OK(lst->Append(3));
> RETURN_NOT_OK(lst->Append(4));
> StructBuilder::Struct strct;
> RETURN_NOT_OK(struct_builder.Append());
> RETURN_NOT_OK(strct.Set(0, "uuid"));
> RETURN_NOT_OK(strct.Set(2, 47));
> RETURN_NOT_OK(strct->Finish()); // appends null to unspecified fields
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4022) [C++] RFC: promote Datum variant out of compute namespace

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4022:

Fix Version/s: (was: 0.14.0)

> [C++] RFC: promote Datum variant out of compute namespace
> -
>
> Key: ARROW-4022
> URL: https://issues.apache.org/jira/browse/ARROW-4022
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> In working on ARROW-3762, I've found it's useful to be able to have functions 
> return either {{Array}} or {{ChunkedArray}}. We might consider promoting the 
> {{arrow::compute::Datum}} variant out of {{arrow/compute/kernel.h}} so it can 
> be used in other places where it's helpful



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4001) [Python] Create Parquet Schema in python

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4001:

Fix Version/s: (was: 0.14.0)

> [Python] Create Parquet Schema in python
> 
>
> Key: ARROW-4001
> URL: https://issues.apache.org/jira/browse/ARROW-4001
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: David Stauffer
>Priority: Major
>  Labels: parquet
>
> Enable the creation of a Parquet schema in python. For functions like 
> pyarrow.parquet.ParquetDataset, a schema must be a Parquet schema. See: 
> https://stackoverflow.com/questions/53725691/pyarrow-lib-schema-vs-pyarrow-parquet-schema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4046) [Python/CI] Run nightly large memory tests

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4046:

Fix Version/s: (was: 0.14.0)

> [Python/CI] Run nightly large memory tests
> --
>
> Key: ARROW-4046
> URL: https://issues.apache.org/jira/browse/ARROW-4046
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: nightly
>
> See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4046) [Python/CI] Run nightly large memory tests

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4046:

Labels: nightly  (was: )

> [Python/CI] Run nightly large memory tests
> --
>
> Key: ARROW-4046
> URL: https://issues.apache.org/jira/browse/ARROW-4046
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: nightly
> Fix For: 0.14.0
>
>
> See comment https://github.com/apache/arrow/pull/3171#issuecomment-447156646



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-5445) [Website] Remove language that encourages pinning a version

2019-05-30 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-5445.
---
Resolution: Won't Fix

https://github.com/apache/arrow/pull/4411#discussion_r288957237

{quote}
Version pinning is commonplace in the Python world -- I don't think API 
stability has much to do with it (we will still have some API changes or 
deprecations after 1.0 I would guess)
{quote}

> [Website] Remove language that encourages pinning a version
> ---
>
> Key: ARROW-5445
> URL: https://issues.apache.org/jira/browse/ARROW-5445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Priority: Minor
> Fix For: 1.0.0
>
>
> See [https://github.com/apache/arrow/pull/4411#discussion_r288804415]. 
> Whenever we decide to stop threatening to break APIs (1.0 release or 
> otherwise), purge any recommendations like this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3896) [MATLAB] Decouple MATLAB-Arrow conversion logic from Feather file specific logic

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3896:

Fix Version/s: (was: 0.14.0)

> [MATLAB] Decouple MATLAB-Arrow conversion logic from Feather file specific 
> logic
> 
>
> Key: ARROW-3896
> URL: https://issues.apache.org/jira/browse/ARROW-3896
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: MATLAB
>Reporter: Kevin Gurney
>Assignee: Kevin Gurney
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, the logic for converting between a MATLAB mxArray and various 
> Arrow data structures (arrow::Table, arrow::Array, etc.) is tightly coupled 
> and fairly tangled up with the logic specific to handling Feather files. It 
> would be helpful to factor out these conversions into a more generic 
> "mlarrow" conversion layer component so that it can be reused in the future 
> for use cases other than Feather support. Furthermore, this would be helpful 
> to enforce a cleaner separation of concerns.
> It would be nice to start off with this refactoring work up front before 
> adding support for more datatypes to the MATLAB featherread/featherwrite 
> functions, so that we can start off with a clean base upon which to expand 
> moving forward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3919:

Fix Version/s: (was: 0.14.0)

> [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize
> -
>
> Key: ARROW-3919
> URL: https://issues.apache.org/jira/browse/ARROW-3919
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> see https://github.com/modin-project/modin/issues/266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3901) [Python] Make Schema hashable

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3901:

Fix Version/s: (was: 0.14.0)

> [Python] Make Schema hashable
> -
>
> Key: ARROW-3901
> URL: https://issues.apache.org/jira/browse/ARROW-3901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> Currently pa.Schema is not hashable, however all of its components are 
> hashable 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3873:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Build shared libraries consistently with -fvisibility=hidden
> --
>
> Key: ARROW-3873
> URL: https://issues.apache.org/jira/browse/ARROW-3873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/2437



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3873) [C++] Build shared libraries consistently with -fvisibility=hidden

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852552#comment-16852552
 ] 

Wes McKinney commented on ARROW-3873:
-

I might take another crack at this to see if it is doable, but after 0.14

> [C++] Build shared libraries consistently with -fvisibility=hidden
> --
>
> Key: ARROW-3873
> URL: https://issues.apache.org/jira/browse/ARROW-3873
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/pull/2437



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852549#comment-16852549
 ] 

Wes McKinney commented on ARROW-3801:
-

cc [~jorisvandenbossche]

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> 
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.10.0
>Reporter: Thomas Buhrmann
>Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will 
> make the categorical index non-writeable, which in turn trips up pandas when 
> e.g. reordering the categories, raising "ValueError: buffer source array is 
> read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---
> ValueError Traceback (most recent call last)
>  in 
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3806) [Python] When converting nested types to pandas, use tuples

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3806:

Fix Version/s: (was: 0.14.0)

> [Python] When converting nested types to pandas, use tuples
> ---
>
> Key: ARROW-3806
> URL: https://issues.apache.org/jira/browse/ARROW-3806
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1
> Environment: Fedora 29, pyarrow installed with conda
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: pandas
>
> When converting to pandas, convert nested types (e.g. list) to tuples.  
> Columns with lists are difficult to query.  Here are a few unsuccessful 
> attempts:
> {code}
> >>> mini
> CHROMPOS   IDREFALTS  QUAL
> 80 20  63521  rs191905748  G [A]   100
> 81 20  63541  rs117322527  C [A]   100
> 82 20  63548  rs541129280  G[GT]   100
> 83 20  63553  rs536661806  T [C]   100
> 84 20  63555  rs553463231  T [C]   100
> 85 20  63559  rs138359120  C [A]   100
> 86 20  63586  rs545178789  T [G]   100
> 87 20  63636  rs374311122  G [A]   100
> 88 20  63696  rs149160003  A [G]   100
> 89 20  63698  rs544072005  A [C]   100
> 90 20  63729  rs181483669  G [A]   100
> 91 20  63733   rs75670495  C [T]   100
> 92 20  63799rs1418258  C [T]   100
> 93 20  63808   rs76004960  G [C]   100
> 94 20  63813  rs532151719  G [A]   100
> 95 20  63857  rs543686274  CCTGGAAAGGATT [C]   100
> 96 20  63865  rs551938596  G [A]   100
> 97 20  63902  rs571779099  A [T]   100
> 98 20  63963  rs531152674  G [A]   100
> 99 20  63967  rs116770801  A [G]   100
> 10020  63977  rs199703510  C [G]   100
> 10120  64016  rs143263863  G [A]   100
> 10220  64062  rs148297240  G [A]   100
> 10320  64139  rs186497980  G  [A, T]   100
> 10420  64150rs7274499  C [A]   100
> 10520  64151  rs190945171  C [T]   100
> 10620  64154  rs537656456  T [G]   100
> 10720  64175  rs116531220  A [G]   100
> 10820  64186  rs141793347  C [G]   100
> 10920  64210  rs182418654  G [C]   100
> 11020  64303  rs559929739  C [A]   100
> {code}
> # I think this one fails because it tries to broadcast the comparison.
> {code}
> >>> mini[mini.ALTS == ["A", "T"]]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 
> 1283, in wrapper
> res = na_op(values, other)
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 
> 1143, in na_op
> result = _comp_method_OBJECT_ARRAY(op, x, y)
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 
> 1120, in _comp_method_OBJECT_ARRAY
> result = libops.vec_compare(x, y, op)
>   File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
> ValueError: Arrays were different lengths: 31 vs 2
> {code}
> # I think this fails due to a similar reason, but the broadcasting is 
> happening at a different place.
> {code}
> >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2682, in __getitem__
> return self._getitem_array(key)
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", 
> line 2726, in _getitem_array
> indexer = self.loc._convert_to_indexer(key, axis=1)
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", 
> line 1314, in _convert_to_indexer
> indexer = check = labels.get_indexer(objarr)
>   File 
> "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3259, in get_indexer
> indexer = self._engine.get_indexer(target._ndarray_values)
>   File "pandas/_libs/index.pyx", line 301, in 
> pandas._libs.index.IndexEngine.get_indexer
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in 
> pandas._libs.hashtable.PyObjectHashTable.lookup
> TypeError: unhashable type: 'numpy.ndarray'
> >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
> 80 [True, False]
> 81 [True, False]
> 82[False, False]
> 83  

[jira] [Updated] (ARROW-3827) [Rust] Implement UnionArray

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3827:

Fix Version/s: (was: 0.14.0)

> [Rust] Implement UnionArray
> ---
>
> Key: ARROW-3827
> URL: https://issues.apache.org/jira/browse/ARROW-3827
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3789) [Python] Enable calling object in Table.to_pandas to "self-destruct" for improved memory use

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3789:

Fix Version/s: (was: 0.14.0)

> [Python] Enable calling object in Table.to_pandas to "self-destruct" for 
> improved memory use
> 
>
> Key: ARROW-3789
> URL: https://issues.apache.org/jira/browse/ARROW-3789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> One issue with using {{Table.to_pandas}} is that it results in a memory 
> doubling (at least, more if there are a lot of Python objects created). It 
> would be useful if there was an option to destroy the {{arrow::Column}} 
> references once they've been transferred into the target data frame. This 
> would render the {{pyarrow.Table}} object useless afterward



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3764) [C++] Port Python "ParquetDataset" business logic to C++

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3764:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Port Python "ParquetDataset" business logic to C++
> 
>
> Key: ARROW-3764
> URL: https://issues.apache.org/jira/browse/ARROW-3764
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: datasets, parquet
> Fix For: 0.15.0
>
>
> Along with defining appropriate abstractions for dealing with generic 
> filesystems in C++, we should implement the machinery for reading multiple 
> Parquet files in C++ so that it can reused in GLib, R, and Ruby. Otherwise 
> these languages will have to reimplement things, and this would surely result 
> in inconsistent features, bugs in some implementations but not others



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3759) [R][CI] Build and test on Windows in Appveyor

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852548#comment-16852548
 ] 

Wes McKinney commented on ARROW-3759:
-

cc [~npr]

> [R][CI] Build and test on Windows in Appveyor
> -
>
> Key: ARROW-3759
> URL: https://issues.apache.org/jira/browse/ARROW-3759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3730) [Python] Output a representation of pyarrow.Schema that can be used to reconstruct a schema in a script

2019-05-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852547#comment-16852547
 ] 

Wes McKinney commented on ARROW-3730:
-

cc [~jorisvandenbossche]

> [Python] Output a representation of pyarrow.Schema that can be used to 
> reconstruct a schema in a script
> ---
>
> Key: ARROW-3730
> URL: https://issues.apache.org/jira/browse/ARROW-3730
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This would be like what {{__repr__}} is used for in many built-in Python 
> types, or a schema as a list of tuples that can be passed to 
> {{pyarrow.schema}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3758) [R] Build R library on Windows, document build instructions for Windows developers

2019-05-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3758:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [R] Build R library on Windows, document build instructions for Windows 
> developers
> --
>
> Key: ARROW-3758
> URL: https://issues.apache.org/jira/browse/ARROW-3758
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >