[jira] [Created] (ARROW-5484) [Java] remove FieldReader from ValueVector
Pindikura Ravindra created ARROW-5484: - Summary: [Java] remove FieldReader from ValueVector Key: ARROW-5484 URL: https://issues.apache.org/jira/browse/ARROW-5484 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra Every implementation of ValueVector has an instance of .FieldReader, which has an overhead of 28 bytes on the heap. This can be avoided by instantiating the object only when required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5483) [Java] add ValueVector constructors that take a Field object
Pindikura Ravindra created ARROW-5483: - Summary: [Java] add ValueVector constructors that take a Field object Key: ARROW-5483 URL: https://issues.apache.org/jira/browse/ARROW-5483 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra Each instance of a ValueVector instantiates Field and FieldType object, which consume 81 bytes of heap space. This duplication be avoided in cases where all the ValueVectors belong to the same set of columns/schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5482) [Java] reduce heap footprint of ValueVectors
[ https://issues.apache.org/jira/browse/ARROW-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra updated ARROW-5482: -- Summary: [Java] reduce heap footprint of ValueVectors (was: reduce heap footprint of ValueVectors) > [Java] reduce heap footprint of ValueVectors > > > Key: ARROW-5482 > URL: https://issues.apache.org/jira/browse/ARROW-5482 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Pindikura Ravindra >Assignee: Pindikura Ravindra >Priority: Major > > In some scenarios, we hold lots of value vectors in memory eg. during join, > aggregation. The heap analysis shows that the costs are as follows for a > simple IntVector (used VisualVM on mac) : > > IntVector : 80 bytes > vector.types.pojo.FieldType : 41 bytes > vector.types.pojo.Field : 40 bytes > IntReaderImpl : 28 bytes > > I'll use this Jira to track ways to reduce the heap usage. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5482) reduce heap footprint of ValueVectors
Pindikura Ravindra created ARROW-5482: - Summary: reduce heap footprint of ValueVectors Key: ARROW-5482 URL: https://issues.apache.org/jira/browse/ARROW-5482 Project: Apache Arrow Issue Type: Task Components: Java Reporter: Pindikura Ravindra Assignee: Pindikura Ravindra In some scenarios, we hold lots of value vectors in memory eg. during join, aggregation. The heap analysis shows that the costs are as follows for a simple IntVector (used VisualVM on mac) : IntVector : 80 bytes vector.types.pojo.FieldType : 41 bytes vector.types.pojo.Field : 40 bytes IntReaderImpl : 28 bytes I'll use this Jira to track ways to reduce the heap usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5461) [Java] Add micro-benchmarks for Float8Vector and allocators
[ https://issues.apache.org/jira/browse/ARROW-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-5461. Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4430 [https://github.com/apache/arrow/pull/4430] > [Java] Add micro-benchmarks for Float8Vector and allocators > --- > > Key: ARROW-5461 > URL: https://issues.apache.org/jira/browse/ARROW-5461 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > For the past days, we have been involved in some performance related issues. > In this process, we have created some performance benchmarks, to help us > verify performance results. > Now we want to add such micro-benchmarks to the code base, in the hope that > they will be helpful for making performance-related decisions and avoid > performance degradation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5256) [Packaging][deb] Failed to build with LLVM 7.1.0
[ https://issues.apache.org/jira/browse/ARROW-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5256: -- Labels: pull-request-available (was: ) > [Packaging][deb] Failed to build with LLVM 7.1.0 > > > Key: ARROW-5256 > URL: https://issues.apache.org/jira/browse/ARROW-5256 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva, Packaging >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > > https://travis-ci.org/ursa-labs/crossbow/builds/527710714#L6144-L6157 > {noformat} > CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package): > Could not find a configuration file for package "LLVM" that is compatible > with requested version "7.0". > The following configuration files were considered but not accepted: > /usr/lib/llvm-7/cmake/LLVMConfig.cmake, version: 7.1.0 > /usr/lib/llvm-7/lib/cmake/llvm/LLVMConfig.cmake, version: 7.1.0 > /usr/lib/llvm-7/share/llvm/cmake/LLVMConfig.cmake, version: 7.1.0 > /usr/lib/llvm-3.8/share/llvm/cmake/LLVMConfig.cmake, version: 3.8.1 > /usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.1 > Call Stack (most recent call first): > src/gandiva/CMakeLists.txt:31 (find_package) > {noformat} > Can we use "7" instead of "7.0" for {{ARROW_LLVM_VERSION}}? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-5256) [Packaging][deb] Failed to build with LLVM 7.1.0
[ https://issues.apache.org/jira/browse/ARROW-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei reassigned ARROW-5256: --- Assignee: Sutou Kouhei > [Packaging][deb] Failed to build with LLVM 7.1.0 > > > Key: ARROW-5256 > URL: https://issues.apache.org/jira/browse/ARROW-5256 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Gandiva, Packaging >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > > https://travis-ci.org/ursa-labs/crossbow/builds/527710714#L6144-L6157 > {noformat} > CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package): > Could not find a configuration file for package "LLVM" that is compatible > with requested version "7.0". > The following configuration files were considered but not accepted: > /usr/lib/llvm-7/cmake/LLVMConfig.cmake, version: 7.1.0 > /usr/lib/llvm-7/lib/cmake/llvm/LLVMConfig.cmake, version: 7.1.0 > /usr/lib/llvm-7/share/llvm/cmake/LLVMConfig.cmake, version: 7.1.0 > /usr/lib/llvm-3.8/share/llvm/cmake/LLVMConfig.cmake, version: 3.8.1 > /usr/share/llvm-3.8/cmake/LLVMConfig.cmake, version: 3.8.1 > Call Stack (most recent call first): > src/gandiva/CMakeLists.txt:31 (find_package) > {noformat} > Can we use "7" instead of "7.0" for {{ARROW_LLVM_VERSION}}? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet
[ https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854163#comment-16854163 ] Wes McKinney commented on ARROW-5480: - Parquet has dictionary-encoding as a compression strategy but does not have Categorical per se. As part of ARROW-3246 we should eventually be able to preserve Categorical through Parquet round trips, but there's some tricky issues to sort out > [Python] Pandas categorical type doesn't survive a round-trip through parquet > - > > Key: ARROW-5480 > URL: https://issues.apache.org/jira/browse/ARROW-5480 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python: 3.7.3.final.0 > python-bits: 64 > OS: Linux > OS-release: 5.0.0-15-generic > machine: x86_64 > processor: x86_64 > byteorder: little > pandas: 0.24.2 > numpy: 1.16.4 > pyarrow: 0.13.0 >Reporter: Karl Dunkle Werner >Priority: Minor > > Writing a string categorical variable to from pandas parquet is read back as > string (object dtype). I expected it to be read as category. > The same thing happens if the category is numeric -- a numeric category is > read back as int64. > In the code below, I tried out an in-memory arrow Table, which successfully > translates categories back to pandas. However, when I write to a parquet > file, it's not. > In the scheme of things, this isn't a big deal, but it's a small surprise. > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) > df.dtypes # category > # This works: > pa.Table.from_pandas(df).to_pandas().dtypes # category > df.to_parquet("categories.parquet") > # This reads back object, but I expected category > pd.read_parquet("categories.parquet").dtypes # object > # Numeric categories have the same issue: > df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) > df_num.dtypes # category > pa.Table.from_pandas(df_num).to_pandas().dtypes # category > df_num.to_parquet("categories_num.parquet") > # This reads back int64, but I expected category > pd.read_parquet("categories_num.parquet").dtypes # int64 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5474) [C++] What version of Boost do we require now?
[ https://issues.apache.org/jira/browse/ARROW-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854159#comment-16854159 ] Wes McKinney commented on ARROW-5474: - I'm fine with requiring a recent version since we offer a vendored build option > [C++] What version of Boost do we require now? > -- > > Key: ARROW-5474 > URL: https://issues.apache.org/jira/browse/ARROW-5474 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Fix For: 0.14.0 > > > See debugging on https://issues.apache.org/jira/browse/ARROW-5470. One > possible cause for that error is that the local filesystem patch increased > the version of boost that we actually require. The boost version (1.54 vs > 1.58) was one difference between failure and success. > Another point of confusion was that CMake reported two different versions of > boost at different times. > If we require a minimum version of boost, can we document that better, check > for it more accurately in the build scripts, and fail with a useful message > if that minimum isn't met? Or something else helpful. > If the actual cause of the failure was something else (e.g. compiler > version), we should figure that out too. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5481) [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document
[ https://issues.apache.org/jira/browse/ARROW-5481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16854158#comment-16854158 ] Sutou Kouhei commented on ARROW-5481: - [~shiro615] Could you work on this? > [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document > > > Key: ARROW-5481 > URL: https://issues.apache.org/jira/browse/ARROW-5481 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Sutou Kouhei >Assignee: Yosuke Shiro >Priority: Minor > > https://github.com/apache/arrow/blob/master/c_glib/arrow-glib/input-stream.cpp#L402 > This is follow-up work of > https://github.com/apache/arrow/commit/ff2ee42092c09d13e38205fedd3acbdf375199f0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5481) [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document
Sutou Kouhei created ARROW-5481: --- Summary: [GLib] garrow_seekable_input_stream_peek() misses "error" parameter document Key: ARROW-5481 URL: https://issues.apache.org/jira/browse/ARROW-5481 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Sutou Kouhei Assignee: Yosuke Shiro https://github.com/apache/arrow/blob/master/c_glib/arrow-glib/input-stream.cpp#L402 This is follow-up work of https://github.com/apache/arrow/commit/ff2ee42092c09d13e38205fedd3acbdf375199f0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1261) [Java] Add container type for Map logical type
[ https://issues.apache.org/jira/browse/ARROW-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1261: -- Labels: pull-request-available (was: ) > [Java] Add container type for Map logical type > -- > > Key: ARROW-1261 > URL: https://issues.apache.org/jira/browse/ARROW-1261 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > As follow up to ARROW-1246 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5476) [Java][Memory] Fix Netty ArrowBuf Slice
[ https://issues.apache.org/jira/browse/ARROW-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5476: -- Labels: pull-request-available (was: ) > [Java][Memory] Fix Netty ArrowBuf Slice > --- > > Key: ARROW-5476 > URL: https://issues.apache.org/jira/browse/ARROW-5476 > Project: Apache Arrow > Issue Type: Task >Affects Versions: 0.14.0 >Reporter: Praveen Kumar Desabandu >Assignee: Praveen Kumar Desabandu >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > The slice of netty arrow buf depends on arrow buf reader and writer indexes, > but arrow buf is supposed to only track memory addr + length and there are > places where the arrow buf indexes are not in sync with netty. > So slice should use the indexes in Netty Arrow Buf instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet
Karl Dunkle Werner created ARROW-5480: - Summary: [Python] Pandas categorical type doesn't survive a round-trip through parquet Key: ARROW-5480 URL: https://issues.apache.org/jira/browse/ARROW-5480 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.13.0, 0.11.1 Environment: python: 3.7.3.final.0 python-bits: 64 OS: Linux OS-release: 5.0.0-15-generic machine: x86_64 processor: x86_64 byteorder: little pandas: 0.24.2 numpy: 1.16.4 pyarrow: 0.13.0 Reporter: Karl Dunkle Werner Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category. The same thing happens if the category is numeric -- a numeric category is read back as int64. In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not. In the scheme of things, this isn't a big deal, but it's a small surprise. {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) df.dtypes # category # This works: pa.Table.from_pandas(df).to_pandas().dtypes # category df.to_parquet("categories.parquet") # This reads back object, but I expected category pd.read_parquet("categories.parquet").dtypes # object # Numeric categories have the same issue: df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) df_num.dtypes # category pa.Table.from_pandas(df_num).to_pandas().dtypes # category df_num.to_parquet("categories_num.parquet") # This reads back int64, but I expected category pd.read_parquet("categories_num.parquet").dtypes # int64 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5478) [Packaging] Drop Ubuntu 14.04 support
[ https://issues.apache.org/jira/browse/ARROW-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-5478. Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4448 [https://github.com/apache/arrow/pull/4448] > [Packaging] Drop Ubuntu 14.04 support > - > > Key: ARROW-5478 > URL: https://issues.apache.org/jira/browse/ARROW-5478 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5463) [Rust] Implement AsRef for Buffer
[ https://issues.apache.org/jira/browse/ARROW-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5463: -- Labels: pull-request-available (was: ) > [Rust] Implement AsRef for Buffer > - > > Key: ARROW-5463 > URL: https://issues.apache.org/jira/browse/ARROW-5463 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > > Implement AsRef ArrowNativeType for Buffer -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5479) [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing
[ https://issues.apache.org/jira/browse/ARROW-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5479: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing > > > Key: ARROW-5479 > URL: https://issues.apache.org/jira/browse/ARROW-5479 > Project: Apache Arrow > Issue Type: Test > Components: Rust - DataFusion >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Trivial > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5479) [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing
[ https://issues.apache.org/jira/browse/ARROW-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-5479. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4449 [https://github.com/apache/arrow/pull/4449] > [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing > > > Key: ARROW-5479 > URL: https://issues.apache.org/jira/browse/ARROW-5479 > Project: Apache Arrow > Issue Type: Test > Components: Rust - DataFusion >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Trivial > Fix For: 0.14.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5479) [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing
Chao Sun created ARROW-5479: --- Summary: [Rust] [DataFusion] Use ARROW_TEST_DATA instead of relative path for testing Key: ARROW-5479 URL: https://issues.apache.org/jira/browse/ARROW-5479 Project: Apache Arrow Issue Type: Test Components: Rust - DataFusion Reporter: Chao Sun Assignee: Chao Sun -- This message was sent by Atlassian JIRA (v7.6.3#76005)