[jira] [Updated] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3246:

Summary: [Python][Parquet] direct reading/writing of pandas categoricals in 
parquet  (was: [Python] direct reading/writing of pandas categoricals in 
parquet)

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Minor
>  Labels: parquet
> Fix For: 0.14.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3066) [Wiki] Add "How to contribute" to developer wiki

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3066.
---
   Resolution: Fixed
 Assignee: Wes McKinney
Fix Version/s: (was: 0.14.0)
   0.13.0

This is now part of the main documentation site. I updated 
https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow

> [Wiki] Add "How to contribute" to developer wiki
> 
>
> Key: ARROW-3066
> URL: https://issues.apache.org/jira/browse/ARROW-3066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: okkez
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> The [website|https://arrow.apache.org/] describes:
> > Interested in contributing? Join the mailing list or check out the 
> > developer wiki.
> But I could not find "How to contribute" on [the 
> Wiki|https://cwiki.apache.org/confluence/display/ARROW].
> Though I can find it in the repository:
> * https://github.com/apache/arrow#how-to-contribute
> * 
> https://github.com/apache/arrow/blob/master/.github/CONTRIBUTING.md#how-to-contribute-patches
> We can add the contents to find "How to contribute" more easily.
> Or, we can unify duplicated contents to [the 
> Wiki|https://cwiki.apache.org/confluence/display/ARROW].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3080) [Python] Unify Arrow to Python object conversion paths

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3080:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Unify Arrow to Python object conversion paths
> --
>
> Key: ARROW-3080
> URL: https://issues.apache.org/jira/browse/ARROW-3080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Similar to ARROW-2814, we have inconsistent support for converting Arrow 
> nested types back to object sequences. For example, a list of structs fails 
> when calling {{to_pandas}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3066) [Wiki] Add "How to contribute" to developer wiki

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3066:

Fix Version/s: 0.14.0

> [Wiki] Add "How to contribute" to developer wiki
> 
>
> Key: ARROW-3066
> URL: https://issues.apache.org/jira/browse/ARROW-3066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: okkez
>Priority: Major
> Fix For: 0.14.0
>
>
> The [website|https://arrow.apache.org/] describes:
> > Interested in contributing? Join the mailing list or check out the 
> > developer wiki.
> But I could not find "How to contribute" on [the 
> Wiki|https://cwiki.apache.org/confluence/display/ARROW].
> Though I can find it in the repository:
> * https://github.com/apache/arrow#how-to-contribute
> * 
> https://github.com/apache/arrow/blob/master/.github/CONTRIBUTING.md#how-to-contribute-patches
> We can add the contents to find "How to contribute" more easily.
> Or, we can unify duplicated contents to [the 
> Wiki|https://cwiki.apache.org/confluence/display/ARROW].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3052) [C++] Detect ORC system packages

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806012#comment-16806012
 ] 

Wes McKinney commented on ARROW-3052:
-

I just repurposed this issue to be about fixing ORC to pull from the system (or 
conda) toolchain since that's the only thing left

> [C++] Detect ORC system packages
> 
>
> Key: ARROW-3052
> URL: https://issues.apache.org/jira/browse/ARROW-3052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> See 
> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L155.
>  After the CMake refactor it is possible to use built ORC packages with 
> {{$ORC_HOME}} but not detected like the other toolchain dependencies



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3052) [C++] Detect ORC system packages

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3052:

Fix Version/s: 0.14.0

> [C++] Detect ORC system packages
> 
>
> Key: ARROW-3052
> URL: https://issues.apache.org/jira/browse/ARROW-3052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> See 
> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L155.
>  After the CMake refactor it is possible to use built ORC packages with 
> {{$ORC_HOME}} but not detected like the other toolchain dependencies



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3052) [C++] Detect ORC system packages

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3052:

Summary: [C++] Detect ORC system packages  (was: [C++] Support ORC, GRPC, 
Thrift, and Protobuf when using $ARROW_BUILD_TOOLCHAIN)

> [C++] Detect ORC system packages
> 
>
> Key: ARROW-3052
> URL: https://issues.apache.org/jira/browse/ARROW-3052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It would be good to support these additional toolchain components without 
> having to set extra environment variables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3052) [C++] Detect ORC system packages

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3052:

Description: See 
https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L155.
 After the CMake refactor it is possible to use built ORC packages with 
{{$ORC_HOME}} but not detected like the other toolchain dependencies  (was: It 
would be good to support these additional toolchain components without having 
to set extra environment variables)

> [C++] Detect ORC system packages
> 
>
> Key: ARROW-3052
> URL: https://issues.apache.org/jira/browse/ARROW-3052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See 
> https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake#L155.
>  After the CMake refactor it is possible to use built ORC packages with 
> {{$ORC_HOME}} but not detected like the other toolchain dependencies



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3434) [Packaging] Add Apache ORC C++ library to conda-forge

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3434.
-
   Resolution: Fixed
 Assignee: Uwe L. Korn
Fix Version/s: (was: 0.14.0)
   0.13.0

This has been completed

> [Packaging] Add Apache ORC C++ library to conda-forge
> -
>
> Key: ARROW-3434
> URL: https://issues.apache.org/jira/browse/ARROW-3434
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: toolchain
> Fix For: 0.13.0
>
>
> In the vein of "toolchain all the things", it would be useful to be able to 
> obtain the ORC static libraries from a conda package rather than building 
> from source every time



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3032) [Python] Clean up NumPy-related C++ headers

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3032:

Fix Version/s: 0.15.0

> [Python] Clean up NumPy-related C++ headers
> ---
>
> Key: ARROW-3032
> URL: https://issues.apache.org/jira/browse/ARROW-3032
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> There are 4 different headers. After ARROW-2814, we can probably eliminate 
> numpy_convert.h and combine with numpy_to_arrow.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3016) [C++] Add ability to enable call stack logging for each memory allocation

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3016:

Fix Version/s: 0.14.0

> [C++] Add ability to enable call stack logging for each memory allocation
> -
>
> Key: ARROW-3016
> URL: https://issues.apache.org/jira/browse/ARROW-3016
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> It is possible to gain programmatic access to the call stack in C/C++, e.g.
> https://eli.thegreenplace.net/2015/programmatic-access-to-the-call-stack-in-c/
> It would be valuable to have a debugging option to log the sizes of memory 
> allocations as well as showing the call stack where that allocation is 
> performed. In complex programs, this could help determine the origin of a 
> memory leak



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3399) [Python] Cannot serialize numpy matrix object

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3399:

Fix Version/s: 0.14.0

> [Python] Cannot serialize numpy matrix object
> -
>
> Key: ARROW-3399
> URL: https://issues.apache.org/jira/browse/ARROW-3399
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Mitar
>Priority: Major
> Fix For: 0.14.0
>
>
> This is a regression from 0.9.0 and happens with 0.10.0 with Python 3.6.5 on 
> Linux.
> {code:java}
> from pyarrow import plasma
> import numpy
> import time
> import subprocess
> import os
> import signal
> m = numpy.matrix(numpy.array([[1, 2], [3, 4]]))
> process = subprocess.Popen(['plasma_store', '-m', '100', '-s', 
> '/tmp/plasma', '-d', '/dev/shm'], stdout=subprocess.DEVNULL, 
> stderr=subprocess.DEVNULL, encoding='utf8', preexec_fn=os.setpgrp)
> time.sleep(5)
> client = plasma.connect('/tmp/plasma', '', 0)
> try:
> client.put(m)
> finally:
> client.disconnect()
> os.killpg(os.getpgid(process.pid), signal.SIGTERM)
> {code}
> Error:
> {noformat}
>   File "pyarrow/_plasma.pyx", line 397, in pyarrow._plasma.PlasmaClient.put
>   File "pyarrow/serialization.pxi", line 338, in pyarrow.lib.serialize
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: This object exceeds the maximum 
> recursion depth. It may contain itself recursively.{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2984) [JS] Refactor release verification script to share code with main source release verification script

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2984:

Fix Version/s: (was: JS-0.5.0)
   0.14.0

> [JS] Refactor release verification script to share code with main source 
> release verification script
> 
>
> Key: ARROW-2984
> URL: https://issues.apache.org/jira/browse/ARROW-2984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> There is some possible code duplication. See discussion in ARROW-2977 
> https://github.com/apache/arrow/pull/2369



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2938) [Packaging] Make the source release via crossbow

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806005#comment-16806005
 ] 

Wes McKinney commented on ARROW-2938:
-

I'm not sure this is desirable from a security standpoint

> [Packaging] Make the source release via crossbow
> 
>
> Key: ARROW-2938
> URL: https://issues.apache.org/jira/browse/ARROW-2938
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>
> And make it possible to upload source distribution (signature and checksums 
> as well) to github releases. This will make ARROW-2910 testable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3399) [Python] Cannot serialize numpy matrix object

2019-03-30 Thread Mitar (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806006#comment-16806006
 ] 

Mitar commented on ARROW-3399:
--

This is still happening in 0.12.1.

I think this should be fixed because it will be quite some time before nobody 
will be using matrix class anymore, even if it is deprecated.

> [Python] Cannot serialize numpy matrix object
> -
>
> Key: ARROW-3399
> URL: https://issues.apache.org/jira/browse/ARROW-3399
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Mitar
>Priority: Major
>
> This is a regression from 0.9.0 and happens with 0.10.0 with Python 3.6.5 on 
> Linux.
> {code:java}
> from pyarrow import plasma
> import numpy
> import time
> import subprocess
> import os
> import signal
> m = numpy.matrix(numpy.array([[1, 2], [3, 4]]))
> process = subprocess.Popen(['plasma_store', '-m', '100', '-s', 
> '/tmp/plasma', '-d', '/dev/shm'], stdout=subprocess.DEVNULL, 
> stderr=subprocess.DEVNULL, encoding='utf8', preexec_fn=os.setpgrp)
> time.sleep(5)
> client = plasma.connect('/tmp/plasma', '', 0)
> try:
> client.put(m)
> finally:
> client.disconnect()
> os.killpg(os.getpgid(process.pid), signal.SIGTERM)
> {code}
> Error:
> {noformat}
>   File "pyarrow/_plasma.pyx", line 397, in pyarrow._plasma.PlasmaClient.put
>   File "pyarrow/serialization.pxi", line 338, in pyarrow.lib.serialize
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: This object exceeds the maximum 
> recursion depth. It may contain itself recursively.{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2967) [Python] Add option to treat invalid PyObject* values as null in pyarrow.array

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2967:

Fix Version/s: 0.14.0

> [Python] Add option to treat invalid PyObject* values as null in pyarrow.array
> --
>
> Key: ARROW-2967
> URL: https://issues.apache.org/jira/browse/ARROW-2967
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> See discussion in ARROW-2966



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2939) [Python] API documentation version doesn't match latest on PyPI

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2939:

Fix Version/s: 0.14.0

> [Python] API documentation version doesn't match latest on PyPI
> ---
>
> Key: ARROW-2939
> URL: https://issues.apache.org/jira/browse/ARROW-2939
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ian Robertson
>Priority: Minor
>  Labels: documentation
> Fix For: 0.14.0
>
>
> Hey folks, apologies if this isn't the right place to raise this.  In poking 
> around the web documentation (for pyarrow specifically), it looks like the 
> auto-generated API docs contain commits past the release of 0.9.0.  For 
> example:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.column]
>  * Contains differences merged here: 
> [https://github.com/apache/arrow/pull/1923]
>  * But latest pypi/conda versions of pyarrow are 0.9.0, which don't include 
> that change.
> Not sure if the docs are auto-built off master somewhere, I couldn't find 
> anything about building docs in the docs itself.  I would guess that you may 
> want some of the usage docs to be published in between releases if they're 
> not about new functionality, but the API reference being out of date can be 
> confusing.  Is it possible to anchor the API docs to the latest released 
> version?  Or even something like how Pandas has a whole bunch of old versions 
> still available? (e.g. [https://pandas.pydata.org/pandas-docs/stable/] vs. 
> old versions like [http://pandas.pydata.org/pandas-docs/version/0.17.0/])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2882) [C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2882:

Summary: [C++][Python] Support AWS Firehose partition_scheme implementation 
for Parquet datasets  (was: [Python] Support AWS Firehose partition_scheme 
implementation for Parquet datasets)

> [C++][Python] Support AWS Firehose partition_scheme implementation for 
> Parquet datasets
> ---
>
> Key: ARROW-2882
> URL: https://issues.apache.org/jira/browse/ARROW-2882
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Pablo Javier Takara
>Priority: Major
>  Labels: parquet
>
> I'd like to be able to read a ParquetDataset generated by AWS Firehose.
> The only implementation at the time of writting was the partition scheme 
> created by hive (year=2018/month=01/day=11).
> AWS Firehose partition scheme is a little bit different (2018/01/11).
>  
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2882) [C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806004#comment-16806004
 ] 

Wes McKinney commented on ARROW-2882:
-

I added the C++ component since this will be handled as part of the Datasets 
project

> [C++][Python] Support AWS Firehose partition_scheme implementation for 
> Parquet datasets
> ---
>
> Key: ARROW-2882
> URL: https://issues.apache.org/jira/browse/ARROW-2882
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Pablo Javier Takara
>Priority: Major
>  Labels: dataset, parquet
>
> I'd like to be able to read a ParquetDataset generated by AWS Firehose.
> The only implementation at the time of writting was the partition scheme 
> created by hive (year=2018/month=01/day=11).
> AWS Firehose partition scheme is a little bit different (2018/01/11).
>  
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2882) [C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2882:

Labels: dataset parquet  (was: parquet)

> [C++][Python] Support AWS Firehose partition_scheme implementation for 
> Parquet datasets
> ---
>
> Key: ARROW-2882
> URL: https://issues.apache.org/jira/browse/ARROW-2882
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Pablo Javier Takara
>Priority: Major
>  Labels: dataset, parquet
>
> I'd like to be able to read a ParquetDataset generated by AWS Firehose.
> The only implementation at the time of writting was the partition scheme 
> created by hive (year=2018/month=01/day=11).
> AWS Firehose partition scheme is a little bit different (2018/01/11).
>  
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2860) [Python][Parquet] Null values in a single partition of Parquet dataset, results in invalid schema on read

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2860:

Summary: [Python][Parquet] Null values in a single partition of Parquet 
dataset, results in invalid schema on read  (was: [Python] Null values in a 
single partition of Parquet dataset, results in invalid schema on read)

> [Python][Parquet] Null values in a single partition of Parquet dataset, 
> results in invalid schema on read
> -
>
> Key: ARROW-2860
> URL: https://issues.apache.org/jira/browse/ARROW-2860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Sam Oluwalana
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
> """Generate data."""
> now = datetime.utcnow() + timedelta(seconds=offset)
> obj = {
> 'event_type': event_type,
> 'event_id': event_id,
> 'event_date': now.date(),
> 'foo': None,
> 'bar': u'hello',
> }
> if event_type == 2:
> obj['foo'] = 1
> obj['bar'] = u'world'
> if event_type == 3:
> obj['different'] = u'data'
> obj['bar'] = u'event type 3'
> else:
> obj['different'] = None
> return obj
> data = [
> generate_data(1, 1, 1),
> generate_data(1, 1, 3600 * 72),
> generate_data(2, 1, 1),
> generate_data(2, 1, 3600 * 72),
> generate_data(3, 1, 1),
> generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', 
> partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in 
> dataset = pq.ParquetDataset('/tmp/events')
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 745, in __init__
> self.validate_schemas()
>   File 
> "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py",
>  line 775, in validate_schemas
> dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] 
> /tmp/events/event_type=3/event_date=2018-07-16 
> 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> 
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], 
> "columns": [{"metadata": null, "field_name": "bar", "name": "bar", 
> "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, 
> "field_name": "different", "name": "different", "numpy_type": "object", 
> "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": 
> "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, 
> "field_name": "event_id", "name": "event_id", "numpy_type": "int64", 
> "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": 
> null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from 
> each of the partitions individually and the partitions for `event_type=3 / 
> event_date=*`  both have values for the column `different` whereas the other 
> columns do not. The discrepancy causes the `None` values of the other 
> partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-465) [C++] Investigate usage of madvise

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806002#comment-16806002
 ] 

Wes McKinney commented on ARROW-465:


We would need to have a benchmark that exhibits the page faulting behavior. 
Since this issue was over 2 years ago it is a bit stale

> [C++] Investigate usage of madvise 
> ---
>
> Key: ARROW-465
> URL: https://issues.apache.org/jira/browse/ARROW-465
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> In some usecases (e.g. Pandas->Arrow conversion) our main constraint is page 
> faulting not yet accessed pages. 
> With {{madvise}} we can indicate our planned actions to the OS and may 
> improve the performance a bit in these cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2743) [Java] Travis CI test scripts did not catch POM file bug fixed in ARROW-2727

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2743.
---
Resolution: Cannot Reproduce

Closing for now until the issue recurs

> [Java] Travis CI test scripts did not catch POM file bug fixed in ARROW-2727
> 
>
> Key: ARROW-2743
> URL: https://issues.apache.org/jira/browse/ARROW-2743
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This bug was introduced in ARROW-1780. It is unclear why the bug was not 
> triggered in Travis CI; we should see about fixing that



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805996#comment-16805996
 ] 

Wes McKinney commented on ARROW-4301:
-

[~kou] FYI -- master build will be broken after you rebase once the release 
vote closes. See https://github.com/apache/arrow/pull/3435 for the past fix

> [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva 
> submodule
> ---
>
> Key: ARROW-4301
> URL: https://issues.apache.org/jira/browse/ARROW-4301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Reporter: Wes McKinney
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550.
>  This is breaking the build so I'm going to patch manually



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-4301:
-

> [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva 
> submodule
> ---
>
> Key: ARROW-4301
> URL: https://issues.apache.org/jira/browse/ARROW-4301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Reporter: Wes McKinney
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550.
>  This is breaking the build so I'm going to patch manually



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805995#comment-16805995
 ] 

Wes McKinney commented on ARROW-4301:
-

Reopening, this did not work for the 0.13 release either

https://github.com/apache/arrow/tree/apache-arrow-0.13.0/java

https://github.com/apache/arrow/commit/dfb9e7af3cd92722893a3819b6676dfdef08f896

> [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva 
> submodule
> ---
>
> Key: ARROW-4301
> URL: https://issues.apache.org/jira/browse/ARROW-4301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Reporter: Wes McKinney
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550.
>  This is breaking the build so I'm going to patch manually



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4301) [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva submodule

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4301:

Fix Version/s: (was: 0.13.0)
   0.14.0

> [Java][Gandiva] Maven snapshot version update does not seem to update Gandiva 
> submodule
> ---
>
> Key: ARROW-4301
> URL: https://issues.apache.org/jira/browse/ARROW-4301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Reporter: Wes McKinney
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See 
> https://github.com/apache/arrow/commit/a486db8c1476be1165981c4fe22996639da8e550.
>  This is breaking the build so I'm going to patch manually



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2572) [Python] Add factory function to create a Table from Columns and Schema.

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2572:

Fix Version/s: 0.14.0

> [Python] Add factory function to create a Table from Columns and Schema.
> 
>
> Key: ARROW-2572
> URL: https://issues.apache.org/jira/browse/ARROW-2572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Thomas Buhrmann
>Priority: Minor
>  Labels: beginner
> Fix For: 0.14.0
>
>
> At the moment it seems to be impossible in Python to add custom metadata to a 
> Table or Column. The closest I've come is to create a list of new Fields (by 
> "appending" metadata to existing Fields), and then creating a new Schema from 
> these Fields using the Schema factory function. But I can't see how to create 
> a new table from the existing Columns and my new Schema, which I understand 
> would be the way to do it in C++?
> Essentially, wrappers for the Table's Make(...) functions seem to be missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2572) [Python] Add factory function to create a Table from Columns and Schema.

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805981#comment-16805981
 ] 

Wes McKinney commented on ARROW-2572:
-

Do you want to try contributing a patch?

> [Python] Add factory function to create a Table from Columns and Schema.
> 
>
> Key: ARROW-2572
> URL: https://issues.apache.org/jira/browse/ARROW-2572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Thomas Buhrmann
>Priority: Minor
>  Labels: beginner
> Fix For: 0.14.0
>
>
> At the moment it seems to be impossible in Python to add custom metadata to a 
> Table or Column. The closest I've come is to create a list of new Fields (by 
> "appending" metadata to existing Fields), and then creating a new Schema from 
> these Fields using the Schema factory function. But I can't see how to create 
> a new table from the existing Columns and my new Schema, which I understand 
> would be the way to do it in C++?
> Essentially, wrappers for the Table's Make(...) functions seem to be missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2512) [Python] Enable direct interaction of GPU Objects in Python

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2512:

Summary: [Python] Enable direct interaction of GPU Objects in Python  (was: 
[Python ]Enable direct interaction of GPU Objects in Python)

> [Python] Enable direct interaction of GPU Objects in Python
> ---
>
> Key: ARROW-2512
> URL: https://issues.apache.org/jira/browse/ARROW-2512
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma, GPU, Python
>Reporter: William Paul
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Plasma can now manage objects on the GPU, but in order to use this 
> functionality in Python, there needs to be some way to represent these GPU 
> objects in Python that allows computation on the GPU.
> The easiest way to enable this is to rely on a third party library, such as 
> Pytorch, which will allow us to use all of its existing functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2358) [C++][Python] API for Writing to Multiple Feather Files

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2358:

Summary: [C++][Python] API for Writing to Multiple Feather Files  (was: API 
for Writing to Multiple Feather Files)

> [C++][Python] API for Writing to Multiple Feather Files
> ---
>
> Key: ARROW-2358
> URL: https://issues.apache.org/jira/browse/ARROW-2358
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C, C++, Python
>Affects Versions: 0.9.0
>Reporter: Dhruv Madeka
>Priority: Minor
>
> It would be really great to have an API which can write a Table to a 
> `FeatherDataset`. Essentially, taking a name for a file - it would split the 
> table into N-equal parts (which could be determined by the user or the code) 
> and then write the data to N files with a suffix (which is `_part` by default 
> but could be user specificed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2186) [C++] Clean up architecture specific compiler flags

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2186:

Fix Version/s: 0.14.0

> [C++] Clean up architecture specific compiler flags
> ---
>
> Key: ARROW-2186
> URL: https://issues.apache.org/jira/browse/ARROW-2186
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> I noticed that {{-maltivec}} is being passed to the compiler on Linux, with 
> an x86_64 processor. That seemed odd to me. It prompted me to look more 
> generally at our compiler flags related to hardware optimizations. We have 
> the ability to pass {{-msse3}}, but there is a {{ARROW_USE_SSE}} which is 
> only used as a define in some headers. There is {{ARROW_ALTIVEC}}, but no 
> option to pass {{-march}}. Nothing related to AVX/AVX2/AVX512. I think this 
> could do for an overhaul



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2164) [C++] Clean up unnecessary decimal module refs

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2164:

Fix Version/s: 0.14.0

> [C++] Clean up unnecessary decimal module refs
> --
>
> Key: ARROW-2164
> URL: https://issues.apache.org/jira/browse/ARROW-2164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.14.0
>
>
> See this comment: 
> https://github.com/apache/arrow/pull/1610#discussion_r168533239



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2186) [C++] Clean up architecture specific compiler flags

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805979#comment-16805979
 ] 

Wes McKinney commented on ARROW-2186:
-

Is this fixed now maybe?

> [C++] Clean up architecture specific compiler flags
> ---
>
> Key: ARROW-2186
> URL: https://issues.apache.org/jira/browse/ARROW-2186
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> I noticed that {{-maltivec}} is being passed to the compiler on Linux, with 
> an x86_64 processor. That seemed odd to me. It prompted me to look more 
> generally at our compiler flags related to hardware optimizations. We have 
> the ability to pass {{-msse3}}, but there is a {{ARROW_USE_SSE}} which is 
> only used as a define in some headers. There is {{ARROW_ALTIVEC}}, but no 
> option to pass {{-march}}. Nothing related to AVX/AVX2/AVX512. I think this 
> could do for an overhaul



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-1880) [Python] Plasma test flakiness in Travis CI

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1880.
---
Resolution: Cannot Reproduce

These issues seem to not be occurring lately

> [Python] Plasma test flakiness in Travis CI
> ---
>
> Key: ARROW-1880
> URL: https://issues.apache.org/jira/browse/ARROW-1880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> We've been seeing intermittent flakiness of the variety:
> {code}
>  ERRORS 
> 
> __ ERROR at setup of TestPlasmaClient.test_use_one_memory_mapped_file 
> __
> self = 
> test_method =  of >
> [1mdef setup_method(self, test_method):[0m
> [1muse_one_memory_mapped_file = (test_method ==[0m
> [1m  
> self.test_use_one_memory_mapped_file)[0m
> [1m[0m
> [1mimport pyarrow.plasma as plasma[0m
> [1m# Start Plasma store.[0m
> [1mplasma_store_name, self.p = start_plasma_store([0m
> [1muse_valgrind=os.getenv("PLASMA_VALGRIND") == "1",[0m
> [1muse_one_memory_mapped_file=use_one_memory_mapped_file)[0m
> [1m# Connect to Plasma.[0m
> [1m>   self.plasma_client = plasma.connect(plasma_store_name, "", 64)[0m
> [1m[31mpyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/tests/test_plasma.py[0m:164:
>  
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> [1m[31mplasma.pyx[0m:672: in pyarrow.plasma.connect
> [1m???[0m
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> [1m>   ???[0m
> [1m[31mE   pyarrow.lib.ArrowIOError: Could not connect to socket 
> /tmp/plasma_store43998835[0m
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1846) [C++] Implement "any" reduction kernel for boolean data, with the ability to short circuit when applying on chunked data

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1846:

Labels: analytics  (was: )

> [C++] Implement "any" reduction kernel for boolean data, with the ability to 
> short circuit when applying on chunked data
> 
>
> Key: ARROW-1846
> URL: https://issues.apache.org/jira/browse/ARROW-1846
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1798) [C++] Implement x86 SIMD-accelerated binary arithmetic kernels

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1798:

Labels: analytics  (was: )

> [C++] Implement x86 SIMD-accelerated binary arithmetic kernels
> --
>
> Key: ARROW-1798
> URL: https://issues.apache.org/jira/browse/ARROW-1798
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1797) [C++] Implement binary arithmetic kernels for numeric arrays

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1797:

Labels: analytics  (was: )

> [C++] Implement binary arithmetic kernels for numeric arrays
> 
>
> Key: ARROW-1797
> URL: https://issues.apache.org/jira/browse/ARROW-1797
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-1843) Merge tool occasionally leaves JIRAs in an invalid state

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1843.
---
Resolution: Not A Problem

Haven't seen this in a long time. I'm going to assume it was a hiccup with the 
ASF JIRA instance

> Merge tool occasionally leaves JIRAs in an invalid state
> 
>
> Key: ARROW-1843
> URL: https://issues.apache.org/jira/browse/ARROW-1843
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Priority: Major
>
> I have been noticing some patches are getting left in "In Progress" or "Open" 
> state but in the web UI, the JIRA appears to be resolved. I have been having 
> to reopen these issues, then press "Resolve" in the web UI. This isn't 
> happening 100% of the time, but has happened several times today



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1761) [C++] Multi argument operator kernel behavior for decimal columns

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1761:

Summary: [C++] Multi argument operator kernel behavior for decimal columns  
(was: Multi argument operator kernel behavior for decimal columns)

> [C++] Multi argument operator kernel behavior for decimal columns
> -
>
> Key: ARROW-1761
> URL: https://issues.apache.org/jira/browse/ARROW-1761
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> This is a JIRA to discuss the behavior of operator kernels that require more 
> than one decimal column input where the column types have a different 
> {{scale}} parameter.
> For example:
> {code}
> a: decimal(12, 2)
> b: decimal(10, 3)
> c = a + b
> {code}
> Arithmetic is the primary use case, but anything that needs to efficiently 
> operate on decimal columns with different scales would require this 
> functionality.
> I imagine that [~jnadeau] and folks at Dremio have thought about and solved 
> the problem in Java. If so, we should consider implementing this behavior in 
> C++. Otherwise, I'll do a bit of reading and digging to see how existing 
> systems efficiently handle this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1699) [C++] Forward, backward fill kernel functions

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1699:

Labels: analytics  (was: )

> [C++] Forward, backward fill kernel functions
> -
>
> Key: ARROW-1699
> URL: https://issues.apache.org/jira/browse/ARROW-1699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> Like ffill / bfill in pandas (with limit)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1644:

Component/s: C++

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1644:

Summary: [C++][Parquet] Read and write nested Parquet data with a mix of 
struct and list nesting levels  (was: [Python] Read and write nested Parquet 
data with a mix of struct and list nesting levels)

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1599) [C++][Parquet] Unable to read Parquet files with list inside struct

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1599:

Summary: [C++][Parquet] Unable to read Parquet files with list inside 
struct  (was: [Python] Unable to read Parquet files with list inside struct)

> [C++][Parquet] Unable to read Parquet files with list inside struct
> ---
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1599) [C++][Parquet] Unable to read Parquet files with list inside struct

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1599:

Component/s: C++

> [C++][Parquet] Unable to read Parquet files with list inside struct
> ---
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1894:

Labels:   (was: beginner)

> [Python] Treat CPython memoryview or buffer objects equivalently to 
> pyarrow.Buffer in pyarrow.serialize
> ---
>
> Key: ARROW-1894
> URL: https://issues.apache.org/jira/browse/ARROW-1894
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> These should be treated as Buffer-like on serialize. We should consider how 
> to "box" the buffers as the appropriate kind of object (Buffer, memoryview, 
> etc.) when being deserialized



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-1552) [C++] Enable Arrow production builds on Linux / macOS without Boost dependency

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1552.
---
Resolution: Won't Fix

I'm not sure this is the best use of our time. If someone comes along and wants 
to work on this, please feel free

> [C++] Enable Arrow production builds on Linux / macOS without Boost dependency
> --
>
> Key: ARROW-1552
> URL: https://issues.apache.org/jira/browse/ARROW-1552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Currently, we use Boost on a very limited basis. We should consider making 
> Boost a non-dependency on POSIX-based systems (i.e. we can continue to use 
> boost::filesystem on Windows), and still use Boost where useful in the test 
> suite. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1389) [Python] Support arbitrary precision integers

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1389:

Component/s: Python

> [Python] Support arbitrary precision integers
> -
>
> Key: ARROW-1389
> URL: https://issues.apache.org/jira/browse/ARROW-1389
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Priority: Minor
>
> For Python serialization it would be great if we had Arrow support for 
> arbitrary precision integers, see the comment in
> https://github.com/apache/arrow/blob/de7c6715ba244e119913bfa31b8de46dbbd450bf/python/pyarrow/tests/test_serialization.py#L183
> Long integers are for example used in the uuid python module and having this 
> would increase serialization performance for uuids and also make the code 
> cleaner.
> I wonder if this is more generally useful too, any thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1389) [Python] Support arbitrary precision integers

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1389:

Summary: [Python] Support arbitrary precision integers  (was: Support 
arbitrary precision integers)

> [Python] Support arbitrary precision integers
> -
>
> Key: ARROW-1389
> URL: https://issues.apache.org/jira/browse/ARROW-1389
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Minor
>
> For Python serialization it would be great if we had Arrow support for 
> arbitrary precision integers, see the comment in
> https://github.com/apache/arrow/blob/de7c6715ba244e119913bfa31b8de46dbbd450bf/python/pyarrow/tests/test_serialization.py#L183
> Long integers are for example used in the uuid python module and having this 
> would increase serialization performance for uuids and also make the code 
> cleaner.
> I wonder if this is more generally useful too, any thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1059) [C++] Define API for embedding user-defined metadata / Flatbuffer message types in Arrow IPC machinery

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1059:

Fix Version/s: 0.15.0

> [C++] Define API for embedding user-defined metadata / Flatbuffer message 
> types in Arrow IPC machinery
> --
>
> Key: ARROW-1059
> URL: https://issues.apache.org/jira/browse/ARROW-1059
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Currently, the {{MessageHeader}} Flatbuffer union must be modified to 
> serialize new kinds of metadata:
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L85
> It would be interesting if user metadata could be embedded within a 
> particular application that wishes to use the Arrow C++ libraries' zero-copy 
> IPC machinery for serialization of other kinds of data structures. 
> As one approach, the message metadata could be an application-dependent 
> unique identifier for the user defined type, which would internally dispatch 
> to an implementation of an abstract deserializer interface. So in addition to 
> describing the serialized representation of the user type, we also will have 
> to create the abstract API for the user to implement so that the code in 
> {{arrow/ipc}} can be configured to dispatch appropriately. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-843) [C++] Parquet merging unequal but equivalent schemas

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805967#comment-16805967
 ] 

Wes McKinney commented on ARROW-843:


I changed this to a C++ issue since a lot of the datasets logic will be 
migrated to C++ from where it's in Python now

> [C++] Parquet merging unequal but equivalent schemas
> 
>
> Key: ARROW-843
> URL: https://issues.apache.org/jira/browse/ARROW-843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 0.14.0
>
>
> Some Parquet datasets may contain schemas with mixed REQUIRED/OPTIONAL 
> repetition types. While such schemas aren't strictly equal, we will need to 
> consider them equivalent on the read path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-843) [C++] Parquet merging unequal but equivalent schemas

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-843:
---
Labels: dataset parquet  (was: parquet)

> [C++] Parquet merging unequal but equivalent schemas
> 
>
> Key: ARROW-843
> URL: https://issues.apache.org/jira/browse/ARROW-843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 0.14.0
>
>
> Some Parquet datasets may contain schemas with mixed REQUIRED/OPTIONAL 
> repetition types. While such schemas aren't strictly equal, we will need to 
> consider them equivalent on the read path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-843) [C++] Parquet merging unequal but equivalent schemas

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-843:
---
Component/s: (was: Python)
 C++

> [C++] Parquet merging unequal but equivalent schemas
> 
>
> Key: ARROW-843
> URL: https://issues.apache.org/jira/browse/ARROW-843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 0.14.0
>
>
> Some Parquet datasets may contain schemas with mixed REQUIRED/OPTIONAL 
> repetition types. While such schemas aren't strictly equal, we will need to 
> consider them equivalent on the read path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-843) [C++] Parquet merging unequal but equivalent schemas

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-843:
---
Summary: [C++] Parquet merging unequal but equivalent schemas  (was: 
[Python] Parquet merging unequal but equivalent schemas)

> [C++] Parquet merging unequal but equivalent schemas
> 
>
> Key: ARROW-843
> URL: https://issues.apache.org/jira/browse/ARROW-843
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> Some Parquet datasets may contain schemas with mixed REQUIRED/OPTIONAL 
> repetition types. While such schemas aren't strictly equal, we will need to 
> consider them equivalent on the read path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-840) [Python] Provide Python API for creating user-defined data types that can survive Arrow IPC

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-840:
---
Fix Version/s: 0.14.0

> [Python] Provide Python API for creating user-defined data types that can 
> survive Arrow IPC
> ---
>
> Key: ARROW-840
> URL: https://issues.apache.org/jira/browse/ARROW-840
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> The user will provide:
> * Data type subclass that can indicate the physical storage type
> * "get state" and "set state" functions for serializing custom metadata to 
> bytes
> * An optional function for "boxing" scalar values from the physical array 
> storage
> Internally, this will build on an analogous C++ API for defining user data 
> types



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-840) [Python] Provide Python API for creating user-defined data types that can survive Arrow IPC

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805966#comment-16805966
 ] 

Wes McKinney commented on ARROW-840:


This is within reach now because of the C++ extension types

> [Python] Provide Python API for creating user-defined data types that can 
> survive Arrow IPC
> ---
>
> Key: ARROW-840
> URL: https://issues.apache.org/jira/browse/ARROW-840
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> The user will provide:
> * Data type subclass that can indicate the physical storage type
> * "get state" and "set state" functions for serializing custom metadata to 
> bytes
> * An optional function for "boxing" scalar values from the physical array 
> storage
> Internally, this will build on an analogous C++ API for defining user data 
> types



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-823) [Python] Devise a means to serialize arrays of arbitrary Python objects in Arrow IPC messages

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805964#comment-16805964
 ] 

Wes McKinney commented on ARROW-823:


This can be implemented now with ExtensionType

> [Python] Devise a means to serialize arrays of arbitrary Python objects in 
> Arrow IPC messages
> -
>
> Key: ARROW-823
> URL: https://issues.apache.org/jira/browse/ARROW-823
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> Practically speaking, this would involve a "custom" logical type that is 
> "pyobject", represented physically as an array of 64-bit pointers. On 
> serialization, this would need to be converted to a BinaryArray containing 
> pickled objects as binary values
> At the moment, we don't yet have the machinery to deal with "custom" types 
> where the in-memory representation is different from the on-wire 
> representation. This would be a useful use case to work through the design 
> issues
> Interestingly, if done properly, this would enable other Arrow 
> implementations to manipulate (filter, etc.) serialized Python objects as 
> binary blobs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-823) [Python] Devise a means to serialize arrays of arbitrary Python objects in Arrow IPC messages

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-823:
---
Fix Version/s: 0.14.0

> [Python] Devise a means to serialize arrays of arbitrary Python objects in 
> Arrow IPC messages
> -
>
> Key: ARROW-823
> URL: https://issues.apache.org/jira/browse/ARROW-823
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Practically speaking, this would involve a "custom" logical type that is 
> "pyobject", represented physically as an array of 64-bit pointers. On 
> serialization, this would need to be converted to a BinaryArray containing 
> pickled objects as binary values
> At the moment, we don't yet have the machinery to deal with "custom" types 
> where the in-memory representation is different from the on-wire 
> representation. This would be a useful use case to work through the design 
> issues
> Interestingly, if done properly, this would enable other Arrow 
> implementations to manipulate (filter, etc.) serialized Python objects as 
> binary blobs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-792) [Java] Allow loading/unloading vectors without using FieldNodes

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-792:
---
Component/s: Java

> [Java] Allow loading/unloading vectors without using FieldNodes
> ---
>
> Key: ARROW-792
> URL: https://issues.apache.org/jira/browse/ARROW-792
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>Priority: Major
>
> The information stored in FieldNode structure is not strictly necessary for 
> serializing/deserializing vectors. We should allow loading/unloading of 
> vectors without it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-799) [Java] Provide guidance in documentation for using Arrow in an uberjar setting

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-799:
---
Component/s: Java

> [Java] Provide guidance in documentation for using Arrow in an uberjar 
> setting 
> ---
>
> Key: ARROW-799
> URL: https://issues.apache.org/jira/browse/ARROW-799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Jingyuan Wang
>Assignee: Li Jin
>Priority: Major
>  Labels: beginner
>
> Currently, ArrowBuf class directly access the package-private fields of 
> AbstractByteBuf class which makes shading Apache Arrow problematic. If we 
> relocate io.netty namespace excluding io.netty.buffer.ArrowBuf, it would 
> throw out IllegalAccessException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-730) [Format] Define Flatbuffers metadata for random-access compressed block memory format

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-730:
---
Fix Version/s: 0.15.0

> [Format] Define Flatbuffers metadata for random-access compressed block 
> memory format
> -
>
> Key: ARROW-730
> URL: https://issues.apache.org/jira/browse/ARROW-730
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> It would be useful to be able to create a compressed buffer stream as a 
> series of fixed-size blocks, with metadata written at the footer of the 
> stream, so that random access is possible.
> {code}
> compressed_block[0]
> ...
> compressed_block[N-1]
> compression metadata
> metadata_size (int32)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-791) [Java] Check if ArrowBuf is empty buffer in getActualConsumedMemory() and getPossibleConsumedMemory()

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-791:
---
Component/s: Java

> [Java] Check if ArrowBuf is empty buffer in getActualConsumedMemory() and 
> getPossibleConsumedMemory()
> -
>
> Key: ARROW-791
> URL: https://issues.apache.org/jira/browse/ARROW-791
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>Priority: Major
>
> Most of the methods related to memory accounting in ArrowBuf have special 
> handling for the case when then Buffer is the empty buffer instance. This 
> check is missing in these two methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-790) [Java] Fix getField() for NullableMapVector

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-790:
---
Component/s: Java

> [Java] Fix getField() for NullableMapVector
> ---
>
> Key: ARROW-790
> URL: https://issues.apache.org/jira/browse/ARROW-790
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Steven Phillips
>Assignee: Steven Phillips
>Priority: Major
>
> Needs to call super.getField() and return a nullable version of that field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-473:
---
Labels: hdfs pull-request-available  (was: pull-request-available)

> [C++/Python] Add public API for retrieving block locations for a particular 
> HDFS file
> -
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: hdfs, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is necessary for applications looking to schedule data-local work. 
> libhdfs does not have APIs to request the block locations directly, so we 
> need to see if the {{hdfsGetHosts}} function will do what we need. For 
> libhdfs3 there is a public API function 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-412) [Format] Handling of buffer padding in the IPC metadata

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805962#comment-16805962
 ] 

Wes McKinney commented on ARROW-412:


My inclination on this would be that the {{Buffer}} Flatbuffers struct reflects 
the intent of the materialized Buffer object in the client language. So if a 
sender of the protocol intends for the receiver to have a 64-byte padded 
buffer, then this padding should be included in the Buffer struct. 

I can propose some language in the Format documentation to make this clear

> [Format] Handling of buffer padding in the IPC metadata
> ---
>
> Key: ARROW-412
> URL: https://issues.apache.org/jira/browse/ARROW-412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> See discussion in ARROW-399. Do we include padding bytes in the metadata or 
> set the actual used bytes? In the latter case, the padding would be a part of 
> the format (any buffers continue to be expected to be 64-byte padded, to 
> permit AVX512 instructions)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-300) [Format] Add buffer compression option to IPC file format

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-300:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-114) Bring in java-unsafe-tools as utility library for Arrow

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-114:
---
Component/s: Java

> Bring in java-unsafe-tools as utility library for Arrow
> ---
>
> Key: ARROW-114
> URL: https://issues.apache.org/jira/browse/ARROW-114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Jacques Nadeau
>Priority: Minor
>
> Originally here:
> https://github.com/alexkasko/unsafe-tools 
> SGA signed off and received by Secretary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-258) [Format] clarify definition of Buffer in context of RPC, IPC, File

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-258:
---
Fix Version/s: (was: 1.0.0)
   0.14.0

> [Format] clarify definition of Buffer in context of RPC, IPC, File
> --
>
> Key: ARROW-258
> URL: https://issues.apache.org/jira/browse/ARROW-258
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Julien Le Dem
>Priority: Major
> Fix For: 0.14.0
>
>
> currently Buffer has a loosely defined page field used for shared memory only.
> https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-114) [Java] Bring in java-unsafe-tools as utility library for Arrow

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-114:
---
Summary: [Java] Bring in java-unsafe-tools as utility library for Arrow  
(was: [JBring in java-unsafe-tools as utility library for Arrow)

> [Java] Bring in java-unsafe-tools as utility library for Arrow
> --
>
> Key: ARROW-114
> URL: https://issues.apache.org/jira/browse/ARROW-114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Jacques Nadeau
>Priority: Minor
>
> Originally here:
> https://github.com/alexkasko/unsafe-tools 
> SGA signed off and received by Secretary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-241) [Java] Implement splitAndTransfer for UnionVector

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-241:
---
Summary: [Java] Implement splitAndTransfer for UnionVector  (was: Implement 
splitAndTransfer for UnionVector)

> [Java] Implement splitAndTransfer for UnionVector
> -
>
> Key: ARROW-241
> URL: https://issues.apache.org/jira/browse/ARROW-241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Steven Phillips
>Priority: Major
>
> This method was never implemented, and currently is a no op. We should at 
> least do the naive "copy" version of the method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-114) [JBring in java-unsafe-tools as utility library for Arrow

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-114:
---
Summary: [JBring in java-unsafe-tools as utility library for Arrow  (was: 
Bring in java-unsafe-tools as utility library for Arrow)

> [JBring in java-unsafe-tools as utility library for Arrow
> -
>
> Key: ARROW-114
> URL: https://issues.apache.org/jira/browse/ARROW-114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Jacques Nadeau
>Priority: Minor
>
> Originally here:
> https://github.com/alexkasko/unsafe-tools 
> SGA signed off and received by Secretary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-258) [Format] clarify definition of Buffer in context of RPC, IPC, File

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-258:
---
Summary: [Format] clarify definition of Buffer in context of RPC, IPC, File 
 (was: clarify definition of Buffer in context of RPC, IPC, File)

> [Format] clarify definition of Buffer in context of RPC, IPC, File
> --
>
> Key: ARROW-258
> URL: https://issues.apache.org/jira/browse/ARROW-258
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Julien Le Dem
>Priority: Major
> Fix For: 1.0.0
>
>
> currently Buffer has a loosely defined page field used for shared memory only.
> https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-110) [C++] Decide on optimal growth factor when appending to buffers/arrays

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-110.
--
Resolution: Done

It seems we'll settle on 2x for now until it can be demonstrated to be a 
performance problem

> [C++] Decide on optimal growth factor when appending to buffers/arrays
> --
>
> Key: ARROW-110
> URL: https://issues.apache.org/jira/browse/ARROW-110
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> There is some evidence that powers of 2 might not be optimal (the facebook 
> folly library suggests this in there explanation of why they have there own 
> vector type).  They use 1.5 (as do other implementations that don't use two).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-41) C++: Convert RecordBatch to StructArray, and back

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-41.
-
Resolution: Won't Fix

Closing also like ARROW-40 until there is a clearly articulated need

> C++: Convert RecordBatch to StructArray, and back
> -
>
> Key: ARROW-41
> URL: https://issues.apache.org/jira/browse/ARROW-41
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> With {{arrow::TableBatchReader}}, we can turn a Table into a sequence of one 
> or more RecordBatches. It would be useful to be able to easily convert 
> between RecordBatch and a StructArray (which can be semantically equivalent 
> in some contexts)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-61) [Java] Method can return the value bigger than long MAX_VALUE

2019-03-30 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805954#comment-16805954
 ] 

Wes McKinney commented on ARROW-61:
---

Is this still an issue?

> [Java] Method can return the value bigger than long MAX_VALUE
> -
>
> Key: ARROW-61
> URL: https://issues.apache.org/jira/browse/ARROW-61
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
> Environment: Apache Drill, Apache Arrow
>Reporter: Vitalii Diravka
>Priority: Major
>  Labels: adjustScale, arrow, decimal
>
> Method org.apache.drill.exec.util.DecimalUtility.adjustScaleMultiply(long 
> input, int factor) can return the value bigger than long max value.
> For example by comparison two decimal18 values 9223372036854775807 and 0.001. 
> To adjust first value scale this method should return 9223372036854775807 * 
> 1000 - bigger than long max value.
> Class DecimalUtility.java will be a part of org.apache.arrow after renaming 
> described in [DRILL-4455 Depend on Apache Arrow for Vector and Memory| 
> https://issues.apache.org/jira/browse/DRILL-4455]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-40) C++: Reinterpret Struct arrays as tables

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-40.
-
Resolution: Won't Fix

I suspect if this is every needed it will be implemented as part of some other 
patch

> C++: Reinterpret Struct arrays as tables
> 
>
> Key: ARROW-40
> URL: https://issues.apache.org/jira/browse/ARROW-40
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This is mostly a question of layering container types, but will be provided 
> as an API convenience. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-61) [Java] Method can return the value bigger than long MAX_VALUE

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-61:
--
Summary: [Java] Method can return the value bigger than long MAX_VALUE  
(was: Method can return the value bigger than long MAX_VALUE)

> [Java] Method can return the value bigger than long MAX_VALUE
> -
>
> Key: ARROW-61
> URL: https://issues.apache.org/jira/browse/ARROW-61
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
> Environment: Apache Drill, Apache Arrow
>Reporter: Vitalii Diravka
>Priority: Major
>  Labels: adjustScale, arrow, decimal
>
> Method org.apache.drill.exec.util.DecimalUtility.adjustScaleMultiply(long 
> input, int factor) can return the value bigger than long max value.
> For example by comparison two decimal18 values 9223372036854775807 and 0.001. 
> To adjust first value scale this method should return 9223372036854775807 * 
> 1000 - bigger than long max value.
> Class DecimalUtility.java will be a part of org.apache.arrow after renaming 
> described in [DRILL-4455 Depend on Apache Arrow for Vector and Memory| 
> https://issues.apache.org/jira/browse/DRILL-4455]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-4985) [C++] arrow/testing headers are not installed

2019-03-30 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4985.
---
   Resolution: Duplicate
Fix Version/s: (was: 0.14.0)

this was resolved in ARROW-5012

> [C++] arrow/testing headers are not installed
> -
>
> Key: ARROW-4985
> URL: https://issues.apache.org/jira/browse/ARROW-4985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5068) [Gandiva][Packaging] Fix gandiva nightly builds after the CMake refactor

2019-03-30 Thread Praveen Kumar Desabandu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805862#comment-16805862
 ] 

Praveen Kumar Desabandu commented on ARROW-5068:


[~kszucs] I am fixing it as part of ARROW-4959

Could you please review and let me know if you would want me to handle anything 
more as part of the other JIRA.

> [Gandiva][Packaging] Fix gandiva nightly builds after the CMake refactor
> 
>
> Key: ARROW-5068
> URL: https://issues.apache.org/jira/browse/ARROW-5068
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently this is the only filing nightly build: 
> [https://travis-ci.org/kszucs/crossbow/builds/512474452]
>  
> cc [~Pindikura Ravindra]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion

2019-03-30 Thread Jeroen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805787#comment-16805787
 ] 

Jeroen edited comment on ARROW-4844 at 3/30/19 12:10 PM:
-

For example opencv ships the vendored static libs in a special dir 
lib/opencv4/3rdparty. Thereby we can statically link to the library, and the 
vendored libs won't conflict with anything else on the system. I think that is 
more user friendly than restricting static builds by "refusing to vendor 
anything at all".
 
{code}
[MSYS2 CI] mingw-w64-opencv: Checking Binaries
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_annotation.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_interactive-calibration.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_version.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_version_win32.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_visualisation.exe
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_calib3d.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_core.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_features2d.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_flann.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_gapi.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_highgui.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_imgcodecs.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_imgproc.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_ml.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_objdetect.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_photo.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_stitching.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_video.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_videoio.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/opencv4/3rdparty/libade.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/opencv4/3rdparty/libquirc.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/pkgconfig/opencv4.pc
{code}


was (Author: jeroenooms):
For example opencv ships the vendored static libs in a special dir 
lib/opencv4/3rdparty. I think that is more user friendly than restricting 
static builds by "refusing to vendor anything at all".
 
{code}
[MSYS2 CI] mingw-w64-opencv: Checking Binaries
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_annotation.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_interactive-calibration.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_version.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_version_win32.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_visualisation.exe
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_calib3d.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_core.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_features2d.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_flann.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_gapi.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_highgui.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_imgcodecs.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_imgproc.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_ml.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_objdetect.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_photo.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_stitching.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_video.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_videoio.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/opencv4/3rdparty/libade.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/opencv4/3rdparty/libquirc.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/pkgconfig/opencv4.pc
{code}

> Static libarrow is missing vendored libdouble-conversion
> 
>
> Key: ARROW-4844
> URL: https://issues.apache.org/jira/browse/ARROW-4844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Assignee: Uwe L. Korn
>Priority: Major
>
> When trying to statically link the R bindings to libarrow.a, I get linking 
> errors which suggest that libdouble-conversion.a was not properly embedded in 
> libarrow.a. This problem happens on both MacOS and Windows.
> Here is the arrow build log: 
> https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7
> {code}
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> 

[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion

2019-03-30 Thread Jeroen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805787#comment-16805787
 ] 

Jeroen commented on ARROW-4844:
---

For example opencv ships the vendored static libs in a special dir 
lib/opencv4/3rdparty. I think that is more user friendly than restricting 
static builds by "refusing to vendor anything at all".
 
{code}
[MSYS2 CI] mingw-w64-opencv: Checking Binaries
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_annotation.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_interactive-calibration.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_version.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_version_win32.exe
./pkg/mingw-w64-i686-opencv/mingw32/bin/opencv_visualisation.exe
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_calib3d.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_core.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_features2d.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_flann.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_gapi.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_highgui.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_imgcodecs.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_imgproc.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_ml.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_objdetect.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_photo.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_stitching.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_video.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/libopencv_videoio.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/opencv4/3rdparty/libade.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/opencv4/3rdparty/libquirc.a
./pkg/mingw-w64-i686-opencv/mingw32/lib/pkgconfig/opencv4.pc
{code}

> Static libarrow is missing vendored libdouble-conversion
> 
>
> Key: ARROW-4844
> URL: https://issues.apache.org/jira/browse/ARROW-4844
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Assignee: Uwe L. Korn
>Priority: Major
>
> When trying to statically link the R bindings to libarrow.a, I get linking 
> errors which suggest that libdouble-conversion.a was not properly embedded in 
> libarrow.a. This problem happens on both MacOS and Windows.
> Here is the arrow build log: 
> https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7
> {code}
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
>  
> C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647):
>  undefined reference to 
> `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, 
> int*) const'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4959) [Gandiva][Crossbow] Builds broken

2019-03-30 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4959:
--
Labels: pull-request-available  (was: )

> [Gandiva][Crossbow] Builds broken
> -
>
> Key: ARROW-4959
> URL: https://issues.apache.org/jira/browse/ARROW-4959
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
>
> Looks like cross bow builds for Gandiva is broken for the last few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)