[jira] [Created] (ARROW-5484) [Java] remove FieldReader from ValueVector

2019-06-02 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5484:
-

 Summary: [Java] remove FieldReader from ValueVector
 Key: ARROW-5484
 URL: https://issues.apache.org/jira/browse/ARROW-5484
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


Every implementation of ValueVector has an instance of .FieldReader, which has 
an overhead of 28 bytes on the heap. This can be avoided by instantiating the 
object only when required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5483) [Java] add ValueVector constructors that take a Field object

2019-06-02 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5483:
-

 Summary: [Java] add ValueVector constructors that take a Field 
object
 Key: ARROW-5483
 URL: https://issues.apache.org/jira/browse/ARROW-5483
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


Each instance of a ValueVector instantiates Field and FieldType object, which 
consume 81 bytes of heap space. This duplication be avoided in cases where all 
the ValueVectors belong to the same set of columns/schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5482) [Java] reduce heap footprint of ValueVectors

2019-06-02 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra updated ARROW-5482:
--
Summary: [Java] reduce heap footprint of ValueVectors  (was: reduce heap 
footprint of ValueVectors)

> [Java] reduce heap footprint of ValueVectors
> 
>
> Key: ARROW-5482
> URL: https://issues.apache.org/jira/browse/ARROW-5482
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Pindikura Ravindra
>Priority: Major
>
> In some scenarios, we hold lots of value vectors in memory eg. during join, 
> aggregation. The heap analysis shows that the costs are as follows for a 
> simple IntVector (used VisualVM on mac) :
>  
> IntVector : 80 bytes
> vector.types.pojo.FieldType : 41 bytes
> vector.types.pojo.Field : 40 bytes
> IntReaderImpl : 28 bytes
>  
> I'll use this Jira to track ways to reduce the heap usage.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5482) reduce heap footprint of ValueVectors

2019-06-02 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5482:
-

 Summary: reduce heap footprint of ValueVectors
 Key: ARROW-5482
 URL: https://issues.apache.org/jira/browse/ARROW-5482
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


In some scenarios, we hold lots of value vectors in memory eg. during join, 
aggregation. The heap analysis shows that the costs are as follows for a simple 
IntVector (used VisualVM on mac) :

 

IntVector : 80 bytes

vector.types.pojo.FieldType : 41 bytes

vector.types.pojo.Field : 40 bytes

IntReaderImpl : 28 bytes

 

I'll use this Jira to track ways to reduce the heap usage.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5461) [Java] Add micro-benchmarks for Float8Vector and allocators

2019-06-02 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-5461.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4430
[https://github.com/apache/arrow/pull/4430]

> [Java] Add micro-benchmarks for Float8Vector and allocators
> ---
>
> Key: ARROW-5461
> URL: https://issues.apache.org/jira/browse/ARROW-5461
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> For the past days, we have been involved in some performance related issues. 
> In this process, we have created some performance benchmarks, to help us 
> verify performance results.
> Now we want to add such micro-benchmarks to the code base, in the hope that 
> they will be helpful for making performance-related decisions and avoid 
> performance degradation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5256) [Packaging][deb] Failed to build with LLVM 7.1.0

2019-06-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5256:
--
Labels: pull-request-available  (was: )

> [Packaging][deb] Failed to build with LLVM 7.1.0
> 
>
> Key: ARROW-5256
> URL: https://issues.apache.org/jira/browse/ARROW-5256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
>
> https://travis-ci.org/ursa-labs/crossbow/builds/527710714#L6144-L6157
> {noformat}
> CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
>   Could not find a configuration file for package "LLVM" that is compatible
>   with requested version "7.0".
>   The following configuration files were considered but not accepted:
> /usr/lib/llvm-7/cmake/LLVMConfig.cmake, version: 7.1.0
> /usr/lib/llvm-7/lib/cmake/llvm/LLVMConfig.cmake, version: 7.1.0
> /usr/lib/llvm-7/share/llvm/cmake/LLVMConfig.cmake, version: 7.1.0
> /usr/lib/llvm-3.8/share/llvm/cmake/LLVMConfig.cmake, version: 3.8.1
> /usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.1
> Call Stack (most recent call first):
>   src/gandiva/CMakeLists.txt:31 (find_package)
> {noformat}
> Can we use "7" instead of "7.0" for {{ARROW_LLVM_VERSION}}?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5256) [Packaging][deb] Failed to build with LLVM 7.1.0

2019-06-02 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei reassigned ARROW-5256:
---

Assignee: Sutou Kouhei

> [Packaging][deb] Failed to build with LLVM 7.1.0
> 
>
> Key: ARROW-5256
> URL: https://issues.apache.org/jira/browse/ARROW-5256
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva, Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>
> https://travis-ci.org/ursa-labs/crossbow/builds/527710714#L6144-L6157
> {noformat}
> CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
>   Could not find a configuration file for package "LLVM" that is compatible
>   with requested version "7.0".
>   The following configuration files were considered but not accepted:
> /usr/lib/llvm-7/cmake/LLVMConfig.cmake, version: 7.1.0
> /usr/lib/llvm-7/lib/cmake/llvm/LLVMConfig.cmake, version: 7.1.0
> /usr/lib/llvm-7/share/llvm/cmake/LLVMConfig.cmake, version: 7.1.0
> /usr/lib/llvm-3.8/share/llvm/cmake/LLVMConfig.cmake, version: 3.8.1
> /usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.1
> Call Stack (most recent call first):
>   src/gandiva/CMakeLists.txt:31 (find_package)
> {noformat}
> Can we use "7" instead of "7.0" for {{ARROW_LLVM_VERSION}}?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

2019-06-02 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854163#comment-16854163
 ] 

Wes McKinney commented on ARROW-5480:
-

Parquet has dictionary-encoding as a compression strategy but does not have 
Categorical per se. As part of ARROW-3246 we should eventually be able to 
preserve Categorical through Parquet round trips, but there's some tricky 
issues to sort out

> [Python] Pandas categorical type doesn't survive a round-trip through parquet
> -
>
> Key: ARROW-5480
> URL: https://issues.apache.org/jira/browse/ARROW-5480
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1, 0.13.0
> Environment: python: 3.7.3.final.0
> python-bits: 64
> OS: Linux
> OS-release: 5.0.0-15-generic
> machine: x86_64
> processor: x86_64
> byteorder: little
> pandas: 0.24.2
> numpy: 1.16.4
> pyarrow: 0.13.0
>Reporter: Karl Dunkle Werner
>Priority: Minor
>
> Writing a string categorical variable to from pandas parquet is read back as 
> string (object dtype). I expected it to be read as category.
> The same thing happens if the category is numeric -- a numeric category is 
> read back as int64.
> In the code below, I tried out an in-memory arrow Table, which successfully 
> translates categories back to pandas. However, when I write to a parquet 
> file, it's not.
> In the scheme of things, this isn't a big deal, but it's a small surprise.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
> df.dtypes  # category
> # This works:
> pa.Table.from_pandas(df).to_pandas().dtypes  # category
> df.to_parquet("categories.parquet")
> # This reads back object, but I expected category
> pd.read_parquet("categories.parquet").dtypes  # object
> # Numeric categories have the same issue:
> df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
> df_num.dtypes # category
> pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
> df_num.to_parquet("categories_num.parquet")
> # This reads back int64, but I expected category
> pd.read_parquet("categories_num.parquet").dtypes  # int64
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5474) [C++] What version of Boost do we require now?

2019-06-02 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854159#comment-16854159
 ] 

Wes McKinney commented on ARROW-5474:
-

I'm fine with requiring a recent version since we offer a vendored build option

> [C++] What version of Boost do we require now?
> --
>
> Key: ARROW-5474
> URL: https://issues.apache.org/jira/browse/ARROW-5474
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> See debugging on https://issues.apache.org/jira/browse/ARROW-5470. One 
> possible cause for that error is that the local filesystem patch increased 
> the version of boost that we actually require. The boost version (1.54 vs 
> 1.58) was one difference between failure and success. 
> Another point of confusion was that CMake reported two different versions of 
> boost at different times. 
> If we require a minimum version of boost, can we document that better, check 
> for it more accurately in the build scripts, and fail with a useful message 
> if that minimum isn't met? Or something else helpful.
> If the actual cause of the failure was something else (e.g. compiler 
> version), we should figure that out too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5481) [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document

2019-06-02 Thread Sutou Kouhei (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854158#comment-16854158
 ] 

Sutou Kouhei commented on ARROW-5481:
-

[~shiro615] Could you work on this?

> [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document
> 
>
> Key: ARROW-5481
> URL: https://issues.apache.org/jira/browse/ARROW-5481
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Sutou Kouhei
>Assignee: Yosuke Shiro
>Priority: Minor
>
> https://github.com/apache/arrow/blob/master/c_glib/arrow-glib/input-stream.cpp#L402
> This is follow-up work of 
> https://github.com/apache/arrow/commit/ff2ee42092c09d13e38205fedd3acbdf375199f0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5481) [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document

2019-06-02 Thread Sutou Kouhei (JIRA)
Sutou Kouhei created ARROW-5481:
---

 Summary: [GLib] garrow_seekable_input_stream_peek() misses "error" 
parameter document
 Key: ARROW-5481
 URL: https://issues.apache.org/jira/browse/ARROW-5481
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Sutou Kouhei
Assignee: Yosuke Shiro


https://github.com/apache/arrow/blob/master/c_glib/arrow-glib/input-stream.cpp#L402

This is follow-up work of 
https://github.com/apache/arrow/commit/ff2ee42092c09d13e38205fedd3acbdf375199f0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1261) [Java] Add container type for Map logical type

2019-06-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1261:
--
Labels: pull-request-available  (was: )

> [Java] Add container type for Map logical type
> --
>
> Key: ARROW-1261
> URL: https://issues.apache.org/jira/browse/ARROW-1261
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> As follow up to ARROW-1246



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5476) [Java][Memory] Fix Netty ArrowBuf Slice

2019-06-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5476:
--
Labels: pull-request-available  (was: )

> [Java][Memory] Fix Netty ArrowBuf Slice
> ---
>
> Key: ARROW-5476
> URL: https://issues.apache.org/jira/browse/ARROW-5476
> Project: Apache Arrow
>  Issue Type: Task
>Affects Versions: 0.14.0
>Reporter: Praveen Kumar Desabandu
>Assignee: Praveen Kumar Desabandu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The slice of netty arrow buf depends on arrow buf reader and writer indexes, 
> but arrow buf is supposed to only track memory addr + length and there are 
> places where the arrow buf indexes are not in sync with netty.
> So slice should use the indexes in Netty Arrow Buf instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

2019-06-02 Thread Karl Dunkle Werner (JIRA)
Karl Dunkle Werner created ARROW-5480:
-

 Summary: [Python] Pandas categorical type doesn't survive a 
round-trip through parquet
 Key: ARROW-5480
 URL: https://issues.apache.org/jira/browse/ARROW-5480
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.13.0, 0.11.1
 Environment: python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.24.2
numpy: 1.16.4
pyarrow: 0.13.0

Reporter: Karl Dunkle Werner


Writing a string categorical variable to from pandas parquet is read back as 
string (object dtype). I expected it to be read as category.
The same thing happens if the category is numeric -- a numeric category is read 
back as int64.

In the code below, I tried out an in-memory arrow Table, which successfully 
translates categories back to pandas. However, when I write to a parquet file, 
it's not.

In the scheme of things, this isn't a big deal, but it's a small surprise.


{code:python}
import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
df.dtypes  # category

# This works:
pa.Table.from_pandas(df).to_pandas().dtypes  # category

df.to_parquet("categories.parquet")
# This reads back object, but I expected category
pd.read_parquet("categories.parquet").dtypes  # object


# Numeric categories have the same issue:
df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
df_num.dtypes # category

pa.Table.from_pandas(df_num).to_pandas().dtypes  # category

df_num.to_parquet("categories_num.parquet")
# This reads back int64, but I expected category
pd.read_parquet("categories_num.parquet").dtypes  # int64
{code}







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5478) [Packaging] Drop Ubuntu 14.04 support

2019-06-02 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-5478.

   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4448
[https://github.com/apache/arrow/pull/4448]

> [Packaging] Drop Ubuntu 14.04 support
> -
>
> Key: ARROW-5478
> URL: https://issues.apache.org/jira/browse/ARROW-5478
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5463) [Rust] Implement AsRef for Buffer

2019-06-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5463:
--
Labels: pull-request-available  (was: )

> [Rust] Implement AsRef for Buffer
> -
>
> Key: ARROW-5463
> URL: https://issues.apache.org/jira/browse/ARROW-5463
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Implement AsRef ArrowNativeType for Buffer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5479) [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing

2019-06-02 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5479:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing
> 
>
> Key: ARROW-5479
> URL: https://issues.apache.org/jira/browse/ARROW-5479
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Rust - DataFusion
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5479) [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing

2019-06-02 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-5479.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4449
[https://github.com/apache/arrow/pull/4449]

> [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing
> 
>
> Key: ARROW-5479
> URL: https://issues.apache.org/jira/browse/ARROW-5479
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Rust - DataFusion
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Trivial
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5479) [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing

2019-06-02 Thread Chao Sun (JIRA)
Chao Sun created ARROW-5479:
---

 Summary: [Rust] [DataFusion] Use ARROW_TEST_DATA instead of 
relative path for testing
 Key: ARROW-5479
 URL: https://issues.apache.org/jira/browse/ARROW-5479
 Project: Apache Arrow
  Issue Type: Test
  Components: Rust - DataFusion
Reporter: Chao Sun
Assignee: Chao Sun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)