[jira] [Comment Edited] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

2019-08-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899250#comment-16899250
 ] 

Robin Kåveland edited comment on ARROW-6060 at 8/2/19 10:26 PM:


I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files 
that we could load just fine with 16GB of RAM earlier fail to load using VMs 
with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We 
are using {{parquet.ParquetDataset.read()}}, but observe the problem even if we 
read single pieces of the parquet data sets (the pieces are between 100MB and 
200MB). Most of our columns are unicode and probably would be friendly to 
dictionary encoding. The files have been written by spark. Normally, these 
datasets would take a while to load, so memory consumption would grow steadily 
for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few 
seconds, so allocation seems very spiky.


was (Author: kaaveland):
I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files 
that we could load just fine with 16GB of RAM fail to load using VMs with 28GB 
of RAM. Unfortunately, I can't disclose any of the data either. We are using 
{{parquet.ParquetDataset.read()}}, but observe the problem even if we read 
single pieces of the parquet data sets (the pieces are between 100MB and 
200MB). Most of our columns are unicode and probably would be friendly to 
dictionary encoding. The files have been written by spark. Normally, these 
datasets would take a while to load, so memory consumption would grow steadily 
for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few 
seconds, so allocation seems very spiky.

> [Python] too large memory cost using pyarrow.parquet.read_table with 
> use_threads=True
> -
>
> Key: ARROW-6060
> URL: https://issues.apache.org/jira/browse/ARROW-6060
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Kun Liu
>Priority: Major
>
>  I tried to load a parquet file of about 1.8Gb using the following code. It 
> crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
>  However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

2019-08-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899250#comment-16899250
 ] 

Robin Kåveland commented on ARROW-6060:
---

I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files 
that we could load just fine with 16GB of RAM fail to load using VMs with 28GB 
of RAM. Unfortunately, I can't disclose any of the data either. We are using 
{{parquet.ParquetDataset.read()}}, but observe the problem even if we read 
single pieces of the parquet data sets (the pieces are between 100MB and 
200MB). Most of our columns are unicode and probably would be friendly to 
dictionary encoding. The files have been written by spark. Normally, these 
datasets would take a while to load, so memory consumption would grow steadily 
for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few 
seconds, so allocation seems very spiky.

> [Python] too large memory cost using pyarrow.parquet.read_table with 
> use_threads=True
> -
>
> Key: ARROW-6060
> URL: https://issues.apache.org/jira/browse/ARROW-6060
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Kun Liu
>Priority: Major
>
>  I tried to load a parquet file of about 1.8Gb using the following code. It 
> crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
>  However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions

2019-08-02 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6118.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4996
[https://github.com/apache/arrow/pull/4996]

> [Java] Replace google Preconditions with Arrow Preconditions
> 
>
> Key: ARROW-6118
> URL: https://issues.apache.org/jira/browse/ARROW-6118
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, 
> but still some places uses {{com.google.common.base.Preconditions}}.
> Remove google Preconditions meanwhile remove duplicated checks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions

2019-08-02 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6118:
---
Component/s: Java

> [Java] Replace google Preconditions with Arrow Preconditions
> 
>
> Key: ARROW-6118
> URL: https://issues.apache.org/jira/browse/ARROW-6118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, 
> but still some places uses {{com.google.common.base.Preconditions}}.
> Remove google Preconditions meanwhile remove duplicated checks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5527) [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data

2019-08-02 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5527.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4867
[https://github.com/apache/arrow/pull/4867]

> [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data
> ---
>
> Key: ARROW-5527
> URL: https://issues.apache.org/jira/browse/ARROW-5527
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> The current implementation uses `std::vector` and `std::string` with 
> unbounded size. The refactor would take a memory pool in the constructor for 
> buffer management and would get rid of vectors. This will have the side 
> effect of propagating Status to some calls (notably insert due to Upsize 
> failing to resize).
> * MemoTable constructor needs to take a MemoryPool in input
> * GetOrInsert must return Status/Result
> * MemoTable should use a TypeBufferBuilder instead of std::vector
> * BinaryMemoTable should use a BinaryBuilder instead of 
> (std::vector, std::string) pair.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6127) [Website] Refresh website theme

2019-08-02 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6127:
--

 Summary: [Website] Refresh website theme
 Key: ARROW-6127
 URL: https://issues.apache.org/jira/browse/ARROW-6127
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson


Among the things I noticed recently that should be easy to clean up:
 * We should supply a favicon
 * The  is the same for every page and it always says "Apache Arrow 
Homepage"
 * There are no opengraph or twitter card meta tags, so there's no link preview
 * The version of bootstrap used is not current and has been flagged as a 
possible security vulnerability

Much of this could just be fixed by porting to a modern Hugo template, which 
I'll explore.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6125) [Python] Remove any APIs deprecated prior to 0.14.x

2019-08-02 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899179#comment-16899179
 ] 

Neal Richardson commented on ARROW-6125:


See also https://issues.apache.org/jira/browse/ARROW-5244

> [Python] Remove any APIs deprecated prior to 0.14.x
> ---
>
> Key: ARROW-6125
> URL: https://issues.apache.org/jira/browse/ARROW-6125
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> A number of deprecated APIs, like {{pyarrow.open_stream}}, are still available



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6126) [C++] IPC stream reader handling of empty streams potentially not robust

2019-08-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6126:
---

 Summary: [C++] IPC stream reader handling of empty streams 
potentially not robust
 Key: ARROW-6126
 URL: https://issues.apache.org/jira/browse/ARROW-6126
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


If dictionaries are expected in a stream, but the stream terminates, then 
"empty stream" logic is triggered to suppress errors, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L482

It's probably esoteric but this "empty stream" logic will trigger if the stream 
terminates in the middle of the dictionary messages, which is a legitimate 
error. So we should only bail out early (concluding that we have an empty 
stream) if the first dictionary message is null



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6125) [Python] Remove any APIs deprecated prior to 0.14.x

2019-08-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6125:
---

 Summary: [Python] Remove any APIs deprecated prior to 0.14.x
 Key: ARROW-6125
 URL: https://issues.apache.org/jira/browse/ARROW-6125
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.15.0


A number of deprecated APIs, like `pyarrow.open_stream`, are still available



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6125) [Python] Remove any APIs deprecated prior to 0.14.x

2019-08-02 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6125:

Description: A number of deprecated APIs, like {{pyarrow.open_stream}}, are 
still available  (was: A number of deprecated APIs, like `pyarrow.open_stream`, 
are still available)

> [Python] Remove any APIs deprecated prior to 0.14.x
> ---
>
> Key: ARROW-6125
> URL: https://issues.apache.org/jira/browse/ARROW-6125
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> A number of deprecated APIs, like {{pyarrow.open_stream}}, are still available



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5746) [Website] Move website source out of apache/arrow

2019-08-02 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5746:
--

Assignee: Neal Richardson

> [Website] Move website source out of apache/arrow
> -
>
> Key: ARROW-5746
> URL: https://issues.apache.org/jira/browse/ARROW-5746
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>
> Possibly to apache/arrow-site, which already exists for hosting the static 
> built site.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5205) [Python][C++] Improved error messages when user erroneously uses a non-local resource URI to open a file

2019-08-02 Thread Neal Richardson (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-5205:
--

Assignee: (was: Neal Richardson)

> [Python][C++] Improved error messages when user erroneously uses a non-local 
> resource URI to open a file
> 
>
> Key: ARROW-5205
> URL: https://issues.apache.org/jira/browse/ARROW-5205
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> In a number of places if a string filepath is passed, it is assumed to be a 
> local file. Since we are developing better support for file URIs, we may be 
> able to detect that the user has passed an unsupported URI (e.g. something 
> starting with "s3:" or "hdfs:") and return a better error message than "local 
> file not found"
> see
> https://stackoverflow.com/questions/55704943/what-could-be-the-explanation-of-this-pyarrow-lib-arrowioerror/55707311#55707311



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5932) undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

2019-08-02 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899167#comment-16899167
 ] 

Francois Saint-Jacques commented on ARROW-5932:
---

How did you install arrow, from sources?

> undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
> ---
>
> Key: ARROW-5932
> URL: https://issues.apache.org/jira/browse/ARROW-5932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
> Environment: Linux Mint 19.1 Tessa
> g++-6
>Reporter: Cong Ding
>Priority: Critical
>
> I was installing Apache Arrow in my Linux Mint 19.1 Tessa server. I followed 
> the instructions on the official arrow website (using the ubuntu 18.04 
> method). However, when I was trying to compile the examples, the g++ compiler 
> threw out some errors.
> I have updated my g++ to g++-6, update my libstdc++ library, and using flag 
> -lstdc++, but it still didn't work.
>  
> {code:java}
> //代码占位符
> g++-6 -std=c++11 -larrow -lparquet main.cpp -lstdc++ 
> {code}
> The error message:
> /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
> `__cxa_init_primary_exception@CXXABI_1.3.11'
> /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
> `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11'
> collect2: error: ld returned 1 exit status.
>  
> I do not know what to do this moment. Can anyone help me?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6122) [C++] ArgSort kernel must support FixedSizeBinary

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6122:
--
Summary: [C++] ArgSort kernel must support FixedSizeBinary  (was: [C++] 
IsIn kernel must support FixedSizeBinary)

> [C++] ArgSort kernel must support FixedSizeBinary
> -
>
> Key: ARROW-6122
> URL: https://issues.apache.org/jira/browse/ARROW-6122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] ArgSort kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Summary: [C++] ArgSort kernel should not materialize the output internal  
(was: [C++] IsIn kernel should not materialize the output internal)

> [C++] ArgSort kernel should not materialize the output internal
> ---
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6124) [C++] IsIn kernel should sort in a single pass (with nulls)

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6124:
-

 Summary: [C++] IsIn kernel should sort in a single pass (with 
nulls)
 Key: ARROW-6124
 URL: https://issues.apache.org/jira/browse/ARROW-6124
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.0
Reporter: Francois Saint-Jacques


There's a good chance that merge sort must be implemented (spill to disk, 
ChunkedArray, ...)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Labels:   (was: ana)

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6123:
-

 Summary: [C++] IsIn kernel should not materialize the output 
internal
 Key: ARROW-6123
 URL: https://issues.apache.org/jira/browse/ARROW-6123
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Affects Version/s: 0.15.0

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Labels: ana  (was: )

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: ana
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6123:
--
Component/s: C++

> [C++] IsIn kernel should not materialize the output internal
> 
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6122) [C++] IsIn kernel must support FixedSizeBinary

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6122:
-

 Summary: [C++] IsIn kernel must support FixedSizeBinary
 Key: ARROW-6122
 URL: https://issues.apache.org/jira/browse/ARROW-6122
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.0
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6121) [Tools] Improve merge tool cli ergonomic

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6121:
--
Labels: pull-request-available  (was: )

> [Tools] Improve merge tool cli ergonomic
> 
>
> Key: ARROW-6121
> URL: https://issues.apache.org/jira/browse/ARROW-6121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
>
> * Accepts the pull-request number as an optional (first) parameter to the 
> script
> * Supports reading the jira username/password from a file



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6121) [Tools] Improve merge tool cli ergonomic

2019-08-02 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6121:
-

 Summary: [Tools] Improve merge tool cli ergonomic
 Key: ARROW-6121
 URL: https://issues.apache.org/jira/browse/ARROW-6121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques


* Accepts the pull-request number as an optional (first) parameter to the script
* Supports reading the jira username/password from a file



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-1566) [C++] Implement non-materializing sort kernels

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-1566:
-

Assignee: Artem Alekseev

> [C++] Implement non-materializing sort kernels
> --
>
> Key: ARROW-1566
> URL: https://issues.apache.org/jira/browse/ARROW-1566
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Artem Alekseev
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> The output of such operator would be a permutation vector that if applied to 
> a column, would result in the data being sorted like requested. This is 
> similar to numpy's argsort functionality.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-1566) [C++] Implement non-materializing sort kernels

2019-08-02 Thread Francois Saint-Jacques (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-1566.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4861
[https://github.com/apache/arrow/pull/4861]

> [C++] Implement non-materializing sort kernels
> --
>
> Key: ARROW-1566
> URL: https://issues.apache.org/jira/browse/ARROW-1566
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> The output of such operator would be a permutation vector that if applied to 
> a column, would result in the data being sorted like requested. This is 
> similar to numpy's argsort functionality.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6120) [C++][Gandiva] including some headers causes decimal_test to fail

2019-08-02 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-6120:


 Summary: [C++][Gandiva] including some headers causes decimal_test 
to fail
 Key: ARROW-6120
 URL: https://issues.apache.org/jira/browse/ARROW-6120
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Benjamin Kietzman


It seems this is due to precompiled code being contaminated with undesired 
headers

For example, {{#include }} in {{arrow/compare.h}} causes:

{code}
[ RUN  ] TestDecimal.TestCastFunctions
../../src/gandiva/tests/decimal_test.cc:478: Failure
Value of: (array_dec)->Equals(outputs[2], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  0.00,
  0.00,
  0.00,
  0.00
]
../../src/gandiva/tests/decimal_test.cc:481: Failure
Value of: (array_dec)->Equals(outputs[2], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  0.00,
  0.00,
  0.00,
  0.00
]
../../src/gandiva/tests/decimal_test.cc:484: Failure
Value of: (array_dec)->Equals(outputs[3], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  0.00,
  0.00,
  0.00,
  0.00
]
../../src/gandiva/tests/decimal_test.cc:497: Failure
Value of: (array_float64)->Equals(outputs[6], 
arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  1.23,
  1.58,
  -1.23,
  -1.58
] actual array: [
  inf,
  inf,
  -inf,
  -inf
]
[  FAILED  ] TestDecimal.TestCastFunctions (134 ms)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899106#comment-16899106
 ] 

Wes McKinney commented on ARROW-6119:
-

cc [~kszucs]

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899105#comment-16899105
 ] 

Wes McKinney commented on ARROW-6119:
-

We pulled the 0.14.1 wheels because there was a different DLL load issue. I had 
thought the 0.14.0 wheels were working but I guess not. I hope someone can fix 
them before 0.15.0

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Paul Suganthan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899096#comment-16899096
 ] 

Paul Suganthan commented on ARROW-6119:
---

Installed using pip

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899091#comment-16899091
 ] 

Uwe L. Korn commented on ARROW-6119:


How did you install this? Did you use conda (preferred) or pip or did you 
compile it yourself?

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Paul Suganthan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Suganthan updated ARROW-6119:
--
Summary: [Python] PyArrow import fails on Windows Python 3.7  (was: PyArrow 
import fails on Windows Python 3.7)

> [Python] PyArrow import fails on Windows Python 3.7
> ---
>
> Key: ARROW-6119
> URL: https://issues.apache.org/jira/browse/ARROW-6119
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Windows, Python 3.7
>Reporter: Paul Suganthan
>Priority: Major
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6119) PyArrow import fails on Windows Python 3.7

2019-08-02 Thread Paul Suganthan (JIRA)
Paul Suganthan created ARROW-6119:
-

 Summary: PyArrow import fails on Windows Python 3.7
 Key: ARROW-6119
 URL: https://issues.apache.org/jira/browse/ARROW-6119
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
 Environment: Windows, Python 3.7
Reporter: Paul Suganthan


Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: DLL load failed: The specified procedure could not be found.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-3325) [Python] Support reading Parquet binary/string columns directly as DictionaryArray

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3325:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Support reading Parquet binary/string columns directly as 
> DictionaryArray
> --
>
> Key: ARROW-3325
> URL: https://issues.apache.org/jira/browse/ARROW-3325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> Requires PARQUET-1324 and probably quite a bit of extra work  
> Properly implementing this will require dictionary normalization across row 
> groups. When reading a new row group, a fast path that compares the current 
> dictionary with the prior dictionary should be used. This also needs to 
> handle the case where a column chunk "fell back" to PLAIN encoding mid-stream



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5776) [Gandiva][Crossbow] Revert template to have commit ids.

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-5776.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4738
[https://github.com/apache/arrow/pull/4738]

> [Gandiva][Crossbow] Revert template to have commit ids.
> ---
>
> Key: ARROW-5776
> URL: https://issues.apache.org/jira/browse/ARROW-5776
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> We are dependent on the commit ids being present in the cross bow travis 
> templates so that we can sync our builds against the same commit id that was 
> used to create the artifacts.
> So reverting back fetch-head to give back arrow-head.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-02 Thread lidavidm (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898956#comment-16898956
 ] 

lidavidm commented on ARROW-5610:
-

My apologies, I ended up being too busy to look at this.

Thanks for the issue pointers.

> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-02 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898927#comment-16898927
 ] 

Joris Van den Bossche commented on ARROW-5610:
--

{quote}I'll try to take a pass this week, if time permits; we would like this 
functionality{quote}

Did you further look at this?

{quote}
By the way, is there a Jira explicitly for being able to hook into to_pandas, 
or a suggested way to efficiently do a custom Pandas conversion?)
{quote}

There is ARROW-2428 for this about a hook into {{to_pandas}} to specify a 
custom conversion (there is also ARROW-5271 for the other way around: be able 
to specify the final arrow array in pandas -> arrow conversion).

> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type

2019-08-02 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898927#comment-16898927
 ] 

Joris Van den Bossche edited comment on ARROW-5610 at 8/2/19 2:48 PM:
--

{quote}I'll try to take a pass this week, if time permits; we would like this 
functionality{quote}

[~lidavidm] Did you further look at this?

{quote}
By the way, is there a Jira explicitly for being able to hook into to_pandas, 
or a suggested way to efficiently do a custom Pandas conversion?)
{quote}

There is ARROW-2428 for this about a hook into {{to_pandas}} to specify a 
custom conversion (there is also ARROW-5271 for the other way around: be able 
to specify the final arrow array in pandas -> arrow conversion).


was (Author: jorisvandenbossche):
{quote}I'll try to take a pass this week, if time permits; we would like this 
functionality{quote}

Did you further look at this?

{quote}
By the way, is there a Jira explicitly for being able to hook into to_pandas, 
or a suggested way to efficiently do a custom Pandas conversion?)
{quote}

There is ARROW-2428 for this about a hook into {{to_pandas}} to specify a 
custom conversion (there is also ARROW-5271 for the other way around: be able 
to specify the final arrow array in pandas -> arrow conversion).

> [Python] Define extension type API in Python to "receive" or "send" a foreign 
> extension type
> 
>
> Key: ARROW-5610
> URL: https://issues.apache.org/jira/browse/ARROW-5610
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. 
> There will be cases where an extension type is coming from another 
> programming language (e.g. Java), so it would be useful to be able to "plug 
> in" a Python extension type subclass that will be used to deserialize the 
> extension type coming over the wire. This has some different API requirements 
> since the serialized representation of the type will not have knowledge of 
> Python pickling, etc. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-5876) [FlightRPC] Implement basic auth across all languages

2019-08-02 Thread Ryan Murray (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Murray reassigned ARROW-5876:
--

Assignee: Ryan Murray

> [FlightRPC] Implement basic auth across all languages
> -
>
> Key: ARROW-5876
> URL: https://issues.apache.org/jira/browse/ARROW-5876
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Affects Versions: 0.14.0
>Reporter: lidavidm
>Assignee: Ryan Murray
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We should implement a set of common auth methods in Flight itself to have 
> standardized ways to do things like basic auth.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers

2019-08-02 Thread Nick Poorman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898898#comment-16898898
 ] 

Nick Poorman commented on ARROW-6107:
-

https://issues.apache.org/jira/browse/ARROW-4852 Is the same use case I'm 
thinking of.

 

If you have an Arrow Table in C (or Python) and you want to access the data in 
Go, you can pass a pointer back from C to the underlying data buffers. However, 
you still have to collect all the metadata to utilize the buffers. Making CGO 
calls is slow, so being able to pass a pointer to the data buffers and a 
pointer to the serialized metadata would ensure a more constant time when 
crossing the language boundary.

 

I did a simple POC to demonstrate what it would take to collect all the 
information from Python and re-materialize it in Go. 
[https://github.com/nickpoorman/go-py-arrow-bridge] The bottleneck is the 
number of CGO calls required to fetch all the metadata.

> [Go] ipc.Writer Option to skip appending data buffers
> -
>
> Key: ARROW-6107
> URL: https://issues.apache.org/jira/browse/ARROW-6107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Nick Poorman
>Priority: Minor
>
> For cases where we have a known shared memory region, it would be great if 
> the ipc.Writer (and by extension ipc.Reader?) had the ability to write out 
> everything but the actual buffers holding the data. That way we can still 
> utilize the ipc mechanisms to communicate without having to serialize all the 
> underlying data across the wire.
>  
> This seems like it should be possible since the `RecordBatch` flatbuffers 
> only contain the metadata and the underlying data buffers are appended later. 
> We just need to skip appending the underlying data buffers.
>  
> [~sbinet] thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers

2019-08-02 Thread Sebastien Binet (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898880#comment-16898880
 ] 

Sebastien Binet commented on ARROW-6107:


not saying it wouldn't be advisable nor doable, but: if it's already in a shmem 
region, why not just use that already?

(and I guess it's kind of implementing: 
https://issues.apache.org/jira/browse/ARROW-4852)

> [Go] ipc.Writer Option to skip appending data buffers
> -
>
> Key: ARROW-6107
> URL: https://issues.apache.org/jira/browse/ARROW-6107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Nick Poorman
>Priority: Minor
>
> For cases where we have a known shared memory region, it would be great if 
> the ipc.Writer (and by extension ipc.Reader?) had the ability to write out 
> everything but the actual buffers holding the data. That way we can still 
> utilize the ipc mechanisms to communicate without having to serialize all the 
> underlying data across the wire.
>  
> This seems like it should be possible since the `RecordBatch` flatbuffers 
> only contain the metadata and the underlying data buffers are appended later. 
> We just need to skip appending the underlying data buffers.
>  
> [~sbinet] thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5876) [FlightRPC] Implement basic auth across all languages

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5876:
--
Labels: pull-request-available  (was: )

> [FlightRPC] Implement basic auth across all languages
> -
>
> Key: ARROW-5876
> URL: https://issues.apache.org/jira/browse/ARROW-5876
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Affects Versions: 0.14.0
>Reporter: lidavidm
>Priority: Major
>  Labels: pull-request-available
>
> We should implement a set of common auth methods in Flight itself to have 
> standardized ways to do things like basic auth.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6106) Scala lang support

2019-08-02 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898878#comment-16898878
 ] 

Wes McKinney commented on ARROW-6106:
-

You might want to discuss this on the mailing list

> Scala lang support
> --
>
> Key: ARROW-6106
> URL: https://issues.apache.org/jira/browse/ARROW-6106
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Boris V.Kuznetsov
>Priority: Major
>
> I ported the testArrowStream.java to Scala Specs2 and added to the PR
> Pls see more details in my [PR |https://github.com/apache/arrow/pull/4989]
> I'm ready to port other tests as well and add SBT file
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6069) [Rust] [Parquet] Implement Converter to convert record reader to arrow primitive array.

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6069:
--
Labels: pull-request-available  (was: )

> [Rust] [Parquet] Implement Converter to convert record reader to arrow 
> primitive array.
> ---
>
> Key: ARROW-6069
> URL: https://issues.apache.org/jira/browse/ARROW-6069
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6118:
--
Labels: pull-request-available  (was: )

> [Java] Replace google Preconditions with Arrow Preconditions
> 
>
> Key: ARROW-6118
> URL: https://issues.apache.org/jira/browse/ARROW-6118
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
>
> Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, 
> but still some places uses {{com.google.common.base.Preconditions}}.
> Remove google Preconditions meanwhile remove duplicated checks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions

2019-08-02 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6118:
-

 Summary: [Java] Replace google Preconditions with Arrow 
Preconditions
 Key: ARROW-6118
 URL: https://issues.apache.org/jira/browse/ARROW-6118
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Ji Liu
Assignee: Ji Liu


Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, but 
still some places uses {{com.google.common.base.Preconditions}}.

Remove google Preconditions meanwhile remove duplicated checks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently

2019-08-02 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898787#comment-16898787
 ] 

Joris Van den Bossche commented on ARROW-5682:
--

This seems to be specific to the code paths dealing with numpy arrays, as from 
built-in python objects, you get a logical error:

{code}
In [9]: pa.array([1, 2, 3], pa.string())
...
ArrowTypeError: Expected a string or bytes object, got a 'int' object

In [10]: pa.array(np.array([1, 2, 3]), pa.string()) 
Out[10]: 

[
  "",   # <-- this is actually not an empty string but '\x01'
  "",
  ""
]
{code}

I agree that at least an error should be raised instead of those incorrect 
values.

In numpy you can cast ints to their string representation by doing an 
equivalent call:

{code}
In [13]: np.array(np.array([1, 2, 3], dtype=int), dtype=str)
Out[13]: array(['1', '2', '3'], dtype=' [Python] from_pandas conversion casts values to string inconsistently
> -
>
> Key: ARROW-5682
> URL: https://issues.apache.org/jira/browse/ARROW-5682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> When calling {{pa.Array.from_pandas}} primitive data as input, and casting to 
> string with  "type=pa.string()", the resulting pyarrow Array can have 
> inconsistent values. For most input, the result is an empty string, however 
> for some types (int32, int64) the values are '\x01' etc.
> {noformat}
> In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)
> In [9]: pa.Array.from_pandas(s, type=pa.string()) 
>
> Out[9]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)
>
> In [11]: pa.Array.from_pandas(s, type=pa.string())
>
> Out[11]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> {noformat}
> This came from the Spark discussion 
> https://github.com/apache/spark/pull/24930/files#r296187903. Type casting 
> this way in Spark is not supported, but it would be good to get the behavior 
> consistent. Would it be better to raise an UnsupportedOperation error?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently

2019-08-02 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5682:
-
Issue Type: Bug  (was: Improvement)

> [Python] from_pandas conversion casts values to string inconsistently
> -
>
> Key: ARROW-5682
> URL: https://issues.apache.org/jira/browse/ARROW-5682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> When calling {{pa.Array.from_pandas}} primitive data as input, and casting to 
> string with  "type=pa.string()", the resulting pyarrow Array can have 
> inconsistent values. For most input, the result is an empty string, however 
> for some types (int32, int64) the values are '\x01' etc.
> {noformat}
> In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8)
> In [9]: pa.Array.from_pandas(s, type=pa.string()) 
>
> Out[9]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32)
>
> In [11]: pa.Array.from_pandas(s, type=pa.string())
>
> Out[11]: 
> 
> [
>   "",
>   "",
>   ""
> ]
> {noformat}
> This came from the Spark discussion 
> https://github.com/apache/spark/pull/24930/files#r296187903. Type casting 
> this way in Spark is not supported, but it would be good to get the behavior 
> consistent. Would it be better to raise an UnsupportedOperation error?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6117:
--
Labels: pull-request-available  (was: )

> [Java] Fix the set method of FixedSizeBinaryVector
> --
>
> Key: ARROW-6117
> URL: https://issues.apache.org/jira/browse/ARROW-6117
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>
> For the set method, if the parameter is null, it should clear the validity 
> bit. However, the current implementation throws a NullPointerException.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6025) [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 tests

2019-08-02 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898778#comment-16898778
 ] 

Pindikura Ravindra commented on ARROW-6025:
---

thanks [~kszucs] - we'll use this Jira to handle missing timezones. I believe 
we already hit this on windows too, and disabled the tests there.

> [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 
> tests
> ---
>
> Key: ARROW-6025
> URL: https://issues.apache.org/jira/browse/ARROW-6025
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Krisztian Szucs
>Assignee: Prudhvi Porandla
>Priority: Major
>
> I've recently enabled gandiva in the conda c++ ursabot builders. The 
> container doesn't contain the required timezones do the tests are failing:
> {code}
> ../src/gandiva/precompiled/time_test.cc:103: Failure
> Expected equality of these values:
>   castTIMESTAMP_utf8(context_ptr, "2000-09-23 9:45:30.920 Canada/Pacific", 37)
> Which is: 0
>   969727530920
> ../src/gandiva/precompiled/time_test.cc:105: Failure
> Expected equality of these values:
>   castTIMESTAMP_utf8(context_ptr, "2012-02-28 23:30:59 Asia/Kolkata", 32)
> Which is: 0
>   1330452059000
> ../src/gandiva/precompiled/time_test.cc:107: Failure
> Expected equality of these values:
>   castTIMESTAMP_utf8(context_ptr, "1923-10-07 03:03:03 America/New_York", 36)
> Which is: 0
>   -1459094217000
> {code}
> See build: 
> https://ci.ursalabs.org/#/builders/66/builds/3046/steps/8/logs/stdio



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6025) [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 tests

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra reassigned ARROW-6025:
-

Assignee: Prudhvi Porandla

> [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 
> tests
> ---
>
> Key: ARROW-6025
> URL: https://issues.apache.org/jira/browse/ARROW-6025
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Krisztian Szucs
>Assignee: Prudhvi Porandla
>Priority: Major
>
> I've recently enabled gandiva in the conda c++ ursabot builders. The 
> container doesn't contain the required timezones do the tests are failing:
> {code}
> ../src/gandiva/precompiled/time_test.cc:103: Failure
> Expected equality of these values:
>   castTIMESTAMP_utf8(context_ptr, "2000-09-23 9:45:30.920 Canada/Pacific", 37)
> Which is: 0
>   969727530920
> ../src/gandiva/precompiled/time_test.cc:105: Failure
> Expected equality of these values:
>   castTIMESTAMP_utf8(context_ptr, "2012-02-28 23:30:59 Asia/Kolkata", 32)
> Which is: 0
>   1330452059000
> ../src/gandiva/precompiled/time_test.cc:107: Failure
> Expected equality of these values:
>   castTIMESTAMP_utf8(context_ptr, "1923-10-07 03:03:03 America/New_York", 36)
> Which is: 0
>   -1459094217000
> {code}
> See build: 
> https://ci.ursalabs.org/#/builders/66/builds/3046/steps/8/logs/stdio



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector

2019-08-02 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6117:
---

 Summary: [Java] Fix the set method of FixedSizeBinaryVector
 Key: ARROW-6117
 URL: https://issues.apache.org/jira/browse/ARROW-6117
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For the set method, if the parameter is null, it should clear the validity bit. 
However, the current implementation throws a NullPointerException.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6116) [C++][Gandiva] Fix bug in TimedTestFilterAdd2

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6116.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

> [C++][Gandiva] Fix bug in TimedTestFilterAdd2
> -
>
> Key: ARROW-6116
> URL: https://issues.apache.org/jira/browse/ARROW-6116
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Pindikura Ravindra
>Priority: Major
> Fix For: 0.15.0
>
>
> The tests should be : f0 + f1 < f2, instead it's doing f1 + f2 < f2. This was 
> reported via a PR
>  
> [https://github.com/apache/arrow/pull/4976]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898754#comment-16898754
 ] 

Pindikura Ravindra edited comment on ARROW-6112 at 8/2/19 10:02 AM:


sorry, i mistakenly put this Jira ID for an [unrelated 
PR|https://github.com/apache/arrow/pull/4976] - fixed now.


was (Author: pravindra):
sorry, i mistakenly put this Jira ID for an [unrelated PR 
|[https://github.com/apache/arrow/pull/4976]]- fixed now.

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898754#comment-16898754
 ] 

Pindikura Ravindra commented on ARROW-6112:
---

sorry, i mistakenly put this Jira ID for an [unrelated PR 
|[https://github.com/apache/arrow/pull/4976]]- fixed now.

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra reopened ARROW-6112:
---

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra updated ARROW-6112:
--
Fix Version/s: (was: 0.15.0)

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Issue Comment Deleted] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra updated ARROW-6112:
--
Comment: was deleted

(was: Issue resolved by pull request 4976
[https://github.com/apache/arrow/pull/4976])

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6112:
--
Labels: pull-request-available  (was: )

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6112.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4976
[https://github.com/apache/arrow/pull/4976]

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: 0.15.0
>
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5917) [Java] Redesign the dictionary encoder

2019-08-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5917:
--
Labels: pull-request-available  (was: )

> [Java] Redesign the dictionary encoder
> --
>
> Key: ARROW-5917
> URL: https://issues.apache.org/jira/browse/ARROW-5917
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>
> The current dictionary encoder implementation 
> (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance 
> overhead, which prevents it from being useful in practice:
>  # There are repeated conversions between Java objects and bytes (e.g. 
> vector.getObject(i)).
>  # Unnecessary memory copy (the vector data must be copied to the hash table).
>  # The hash table cannot be reused for encoding multiple vectors (other data 
> structure & results cannot be reused either).
>  # The output vector should not be created/managed by the encoder (just like 
> in the out-of-place sorter)
>  # The hash table requires that the hashCode & equals methods be implemented 
> appropriately, but this is not guaranteed.
> We plan to implement a new one in the algorithm module, and gradually 
> deprecate the current one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra reassigned ARROW-6002:
-

Assignee: Benjamin Kietzman

> [C++][Gandiva] TestCastFunctions does not test int64 casting`
> -
>
> Key: ARROW-6002
> URL: https://issues.apache.org/jira/browse/ARROW-6002
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {{outputs[2]}} (corresponds to cast from float32) is checked twice 
> https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478
>  while {{outputs[1]}} is not checked (corresponds to cast from int64)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`

2019-08-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6002.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4991
[https://github.com/apache/arrow/pull/4991]

> [C++][Gandiva] TestCastFunctions does not test int64 casting`
> -
>
> Key: ARROW-6002
> URL: https://issues.apache.org/jira/browse/ARROW-6002
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Benjamin Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{outputs[2]}} (corresponds to cast from float32) is checked twice 
> https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478
>  while {{outputs[1]}} is not checked (corresponds to cast from int64)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6114) [Python] Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-02 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6114:
-
Summary: [Python] Datatypes are not preserved when a pandas dataframe 
partitioned and saved as parquet file using pyarrow  (was: Datatypes are not 
preserved when a pandas dataframe partitioned and saved as parquet file using 
pyarrow)

> [Python] Datatypes are not preserved when a pandas dataframe partitioned and 
> saved as parquet file using pyarrow
> 
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Python 3.7.3
> pyarrow 0.14.1
>Reporter: Naga
>Priority: Major
>  Labels: dataset, parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
> saved as parquet file using pyarrow but that's not the case when the data 
> frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols, 
> preserve_index=False)
>  # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for 
> age is int64 in the original pandas data frame but it got changed to category 
> when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without 
> partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
>  # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning 
> using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
>  * Python 3.7.3
>  * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-02 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6114:
-
Labels: dataset parquet  (was: parquet)

> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Python 3.7.3
> pyarrow 0.14.1
>Reporter: Naga
>Priority: Major
>  Labels: dataset, parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
> saved as parquet file using pyarrow but that's not the case when the data 
> frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols, 
> preserve_index=False)
>  # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for 
> age is int64 in the original pandas data frame but it got changed to category 
> when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without 
> partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
>  # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning 
> using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
>  * Python 3.7.3
>  * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-02 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898737#comment-16898737
 ] 

Joris Van den Bossche commented on ARROW-6114:
--

[~bnriiitb] thanks for opening the issue. 

So when a partitioned dataset is written, the partition columns are not stored 
in the actual data, but are part of the directory schema (in your case you 
would have "age=77", "age=32", etc sub-folders). 

Currently, we don't save any "meta data" about the columns used to partition, 
and since they are also not stored in the actual parquet files (where a schema 
of the data is stored), we don't have that information from there either.

So when reading a partitioned dataset, (py)arrow has not much information about 
the type of this partition column. Currently, the logic is to try to convert 
the values to ints and otherwise leave as strings, and then those values are 
converted to a Dictionary type (corresponding to categorical type in pandas). 
This logic is here: 
https://github.com/apache/arrow/blob/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5/python/pyarrow/parquet.py#L585-L609

There is currently no option to change this. So right now, the workaround is to 
convert the categorical back to an integer column in pandas.  
But longer term, we should maybe think about storing the type of the partition 
keys as metadata, and an option to restore it as a dictionary column or not.

Related issues about the type of the partition column: ARROW-3388 (booleans as 
strings), ARROW-5666 (strings with underscores interpreted as int)

> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Python 3.7.3
> pyarrow 0.14.1
>Reporter: Naga
>Priority: Major
>  Labels: parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
> saved as parquet file using pyarrow but that's not the case when the data 
> frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols, 
> preserve_index=False)
>  # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for 
> age is int64 in the original pandas data frame but it got changed to category 
> when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without 
> partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
>  # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning 
> using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
>  * Python 3.7.3
>  * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

2019-08-02 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898725#comment-16898725
 ] 

Joris Van den Bossche commented on ARROW-5480:
--

{quote}One slightly higher level issue is the extent to which we store Arrow 
schema information in the Parquet metadata.
{quote}

Possibly related to ARROW-5888, where we also need to store arrow-specific 
metadata for faithful roundtrip (in that case the timezone). 

Spark stores the all column types (and optional column metadata) in the 
key_value_metadata of the FileMetadata:

For example for a file with a single int column

{code}
>>> meta = pq.read_metadata('test_pyspark_dataset/_metadata')
>>> meta.metadata
{b'org.apache.spark.sql.parquet.row.metadata': 
b'{"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]}'}
{code}




> [Python] Pandas categorical type doesn't survive a round-trip through parquet
> -
>
> Key: ARROW-5480
> URL: https://issues.apache.org/jira/browse/ARROW-5480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python: 3.7.3.final.0
> python-bits: 64
> OS: Linux
> OS-release: 5.0.0-15-generic
> machine: x86_64
> processor: x86_64
> byteorder: little
> pandas: 0.24.2
> numpy: 1.16.4
> pyarrow: 0.13.0
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Writing a string categorical variable to from pandas parquet is read back as 
> string (object dtype). I expected it to be read as category.
> The same thing happens if the category is numeric -- a numeric category is 
> read back as int64.
> In the code below, I tried out an in-memory arrow Table, which successfully 
> translates categories back to pandas. However, when I write to a parquet 
> file, it's not.
> In the scheme of things, this isn't a big deal, but it's a small surprise.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
> df.dtypes  # category
> # This works:
> pa.Table.from_pandas(df).to_pandas().dtypes  # category
> df.to_parquet("categories.parquet")
> # This reads back object, but I expected category
> pd.read_parquet("categories.parquet").dtypes  # object
> # Numeric categories have the same issue:
> df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
> df_num.dtypes # category
> pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
> df_num.to_parquet("categories_num.parquet")
> # This reads back int64, but I expected category
> pd.read_parquet("categories_num.parquet").dtypes  # int64
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow

2019-08-02 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6114:
-
Labels: parquet  (was: )

> Datatypes are not preserved when a pandas dataframe partitioned and saved as 
> parquet file using pyarrow
> ---
>
> Key: ARROW-6114
> URL: https://issues.apache.org/jira/browse/ARROW-6114
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Python 3.7.3
> pyarrow 0.14.1
>Reporter: Naga
>Priority: Major
>  Labels: parquet
>
> h3. Datatypes are not preserved when a pandas data frame is *partitioned* and 
> saved as parquet file using pyarrow but that's not the case when the data 
> frame is not partitioned.
> *Case 1: Saving a partitioned dataset - Data Types are NOT preserved*
> {code:java}
> # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
> import pandas as pd
> df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test'
> partition_cols=['age']
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, partition_cols=partition_cols, 
> preserve_index=False)
>  # Loading a dataset partioned parquet dataset from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> name object
> age category
> dtype: object
> {code}
> h5. {color:#d04437}From the above output, we could see that the data type for 
> age is int64 in the original pandas data frame but it got changed to category 
> when we saved to local and loaded back.{color}
> *Case 2: Non-partitioned dataset - Data types are preserved*
> {code:java}
> import pandas as pd
> print('Saving a Pandas Dataframe to Local as a parquet file without 
> partitioning using pyarrow')
> df = pd.DataFrame(
> {'age': [77,32,234],'name':['agan','bbobby','test'] }
> )
> path = 'test_without_partition'
> print('Datatypes before saving the dataset')
> print(df.dtypes)
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, path, preserve_index=False)
>  # Loading a non-partioned parquet file from local
> df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
> print('\nDatatypes after loading the dataset')
> print(df.dtypes)
> {code}
> *Output:*
> {code:java}
> Saving a Pandas Dataframe to Local as a parquet file without partitioning 
> using pyarrow
> Datatypes before saving the dataset
> age int64
> name object
> dtype: object
> Datatypes after loading the dataset
> age int64
> name object
> dtype: object
> {code}
> *Versions*
>  * Python 3.7.3
>  * pyarrow 0.14.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6115) [Python] support LargeList, LargeString, LargeBinary in conversion to pandas

2019-08-02 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6115:


 Summary: [Python] support LargeList, LargeString, LargeBinary in 
conversion to pandas
 Key: ARROW-6115
 URL: https://issues.apache.org/jira/browse/ARROW-6115
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


General python support for those 3 new types has been added: ARROW-6000, 
ARROW-6084

However, one aspect that is not yet implemented is conversion to pandas (or 
numpy array):

{code}
In [67]: a = pa.array(['a', 'b', 'c'], pa.large_string()) 

In [68]: a.to_pandas() 
...
ArrowNotImplementedError: large_utf8

In [69]: pa.table({'a': a}).to_pandas()
...
ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of 
type large_string is known.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6106) Scala lang support

2019-08-02 Thread Boris V.Kuznetsov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898673#comment-16898673
 ] 

Boris V.Kuznetsov commented on ARROW-6106:
--

You may run those tests from my integration project: 
{{https://github.com/Neurodyne/apache-arrow-parquet}}

> Scala lang support
> --
>
> Key: ARROW-6106
> URL: https://issues.apache.org/jira/browse/ARROW-6106
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Boris V.Kuznetsov
>Priority: Major
>
> I ported the testArrowStream.java to Scala Specs2 and added to the PR
> Pls see more details in my [PR |https://github.com/apache/arrow/pull/4989]
> I'm ready to port other tests as well and add SBT file
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)