[jira] [Comment Edited] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True
[ https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899250#comment-16899250 ] Robin Kåveland edited comment on ARROW-6060 at 8/2/19 10:26 PM: I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM earlier fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using {{parquet.ParquetDataset.read()}}, but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky. was (Author: kaaveland): I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using {{parquet.ParquetDataset.read()}}, but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky. > [Python] too large memory cost using pyarrow.parquet.read_table with > use_threads=True > - > > Key: ARROW-6060 > URL: https://issues.apache.org/jira/browse/ARROW-6060 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Kun Liu >Priority: Major > > I tried to load a parquet file of about 1.8Gb using the following code. It > crashed due to out of memory issue. > {code:java} > import pyarrow.parquet as pq > pq.read_table('/tmp/test.parquet'){code} > However, it worked well with use_threads=True as follows > {code:java} > pq.read_table('/tmp/test.parquet', use_threads=False){code} > If pyarrow is downgraded to 0.12.1, there is no such problem. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True
[ https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899250#comment-16899250 ] Robin Kåveland commented on ARROW-6060: --- I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files that we could load just fine with 16GB of RAM fail to load using VMs with 28GB of RAM. Unfortunately, I can't disclose any of the data either. We are using {{parquet.ParquetDataset.read()}}, but observe the problem even if we read single pieces of the parquet data sets (the pieces are between 100MB and 200MB). Most of our columns are unicode and probably would be friendly to dictionary encoding. The files have been written by spark. Normally, these datasets would take a while to load, so memory consumption would grow steadily for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few seconds, so allocation seems very spiky. > [Python] too large memory cost using pyarrow.parquet.read_table with > use_threads=True > - > > Key: ARROW-6060 > URL: https://issues.apache.org/jira/browse/ARROW-6060 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Kun Liu >Priority: Major > > I tried to load a parquet file of about 1.8Gb using the following code. It > crashed due to out of memory issue. > {code:java} > import pyarrow.parquet as pq > pq.read_table('/tmp/test.parquet'){code} > However, it worked well with use_threads=True as follows > {code:java} > pq.read_table('/tmp/test.parquet', use_threads=False){code} > If pyarrow is downgraded to 0.12.1, there is no such problem. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions
[ https://issues.apache.org/jira/browse/ARROW-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6118. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4996 [https://github.com/apache/arrow/pull/4996] > [Java] Replace google Preconditions with Arrow Preconditions > > > Key: ARROW-6118 > URL: https://issues.apache.org/jira/browse/ARROW-6118 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, > but still some places uses {{com.google.common.base.Preconditions}}. > Remove google Preconditions meanwhile remove duplicated checks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions
[ https://issues.apache.org/jira/browse/ARROW-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-6118: --- Component/s: Java > [Java] Replace google Preconditions with Arrow Preconditions > > > Key: ARROW-6118 > URL: https://issues.apache.org/jira/browse/ARROW-6118 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, > but still some places uses {{com.google.common.base.Preconditions}}. > Remove google Preconditions meanwhile remove duplicated checks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5527) [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data
[ https://issues.apache.org/jira/browse/ARROW-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5527. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4867 [https://github.com/apache/arrow/pull/4867] > [C++] HashTable/MemoTable should use Buffer(s)/Builder(s) for heap data > --- > > Key: ARROW-5527 > URL: https://issues.apache.org/jira/browse/ARROW-5527 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 9.5h > Remaining Estimate: 0h > > The current implementation uses `std::vector` and `std::string` with > unbounded size. The refactor would take a memory pool in the constructor for > buffer management and would get rid of vectors. This will have the side > effect of propagating Status to some calls (notably insert due to Upsize > failing to resize). > * MemoTable constructor needs to take a MemoryPool in input > * GetOrInsert must return Status/Result > * MemoTable should use a TypeBufferBuilder instead of std::vector > * BinaryMemoTable should use a BinaryBuilder instead of > (std::vector, std::string) pair. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6127) [Website] Refresh website theme
Neal Richardson created ARROW-6127: -- Summary: [Website] Refresh website theme Key: ARROW-6127 URL: https://issues.apache.org/jira/browse/ARROW-6127 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Neal Richardson Assignee: Neal Richardson Among the things I noticed recently that should be easy to clean up: * We should supply a favicon * The is the same for every page and it always says "Apache Arrow Homepage" * There are no opengraph or twitter card meta tags, so there's no link preview * The version of bootstrap used is not current and has been flagged as a possible security vulnerability Much of this could just be fixed by porting to a modern Hugo template, which I'll explore. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6125) [Python] Remove any APIs deprecated prior to 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899179#comment-16899179 ] Neal Richardson commented on ARROW-6125: See also https://issues.apache.org/jira/browse/ARROW-5244 > [Python] Remove any APIs deprecated prior to 0.14.x > --- > > Key: ARROW-6125 > URL: https://issues.apache.org/jira/browse/ARROW-6125 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > A number of deprecated APIs, like {{pyarrow.open_stream}}, are still available -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6126) [C++] IPC stream reader handling of empty streams potentially not robust
Wes McKinney created ARROW-6126: --- Summary: [C++] IPC stream reader handling of empty streams potentially not robust Key: ARROW-6126 URL: https://issues.apache.org/jira/browse/ARROW-6126 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.15.0 If dictionaries are expected in a stream, but the stream terminates, then "empty stream" logic is triggered to suppress errors, see https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc#L482 It's probably esoteric but this "empty stream" logic will trigger if the stream terminates in the middle of the dictionary messages, which is a legitimate error. So we should only bail out early (concluding that we have an empty stream) if the first dictionary message is null -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6125) [Python] Remove any APIs deprecated prior to 0.14.x
Wes McKinney created ARROW-6125: --- Summary: [Python] Remove any APIs deprecated prior to 0.14.x Key: ARROW-6125 URL: https://issues.apache.org/jira/browse/ARROW-6125 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.15.0 A number of deprecated APIs, like `pyarrow.open_stream`, are still available -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6125) [Python] Remove any APIs deprecated prior to 0.14.x
[ https://issues.apache.org/jira/browse/ARROW-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6125: Description: A number of deprecated APIs, like {{pyarrow.open_stream}}, are still available (was: A number of deprecated APIs, like `pyarrow.open_stream`, are still available) > [Python] Remove any APIs deprecated prior to 0.14.x > --- > > Key: ARROW-6125 > URL: https://issues.apache.org/jira/browse/ARROW-6125 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > A number of deprecated APIs, like {{pyarrow.open_stream}}, are still available -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5746) [Website] Move website source out of apache/arrow
[ https://issues.apache.org/jira/browse/ARROW-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-5746: -- Assignee: Neal Richardson > [Website] Move website source out of apache/arrow > - > > Key: ARROW-5746 > URL: https://issues.apache.org/jira/browse/ARROW-5746 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > > Possibly to apache/arrow-site, which already exists for hosting the static > built site. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5205) [Python][C++] Improved error messages when user erroneously uses a non-local resource URI to open a file
[ https://issues.apache.org/jira/browse/ARROW-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-5205: -- Assignee: (was: Neal Richardson) > [Python][C++] Improved error messages when user erroneously uses a non-local > resource URI to open a file > > > Key: ARROW-5205 > URL: https://issues.apache.org/jira/browse/ARROW-5205 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > In a number of places if a string filepath is passed, it is assumed to be a > local file. Since we are developing better support for file URIs, we may be > able to detect that the user has passed an unsupported URI (e.g. something > starting with "s3:" or "hdfs:") and return a better error message than "local > file not found" > see > https://stackoverflow.com/questions/55704943/what-could-be-the-explanation-of-this-pyarrow-lib-arrowioerror/55707311#55707311 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5932) undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
[ https://issues.apache.org/jira/browse/ARROW-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899167#comment-16899167 ] Francois Saint-Jacques commented on ARROW-5932: --- How did you install arrow, from sources? > undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11' > --- > > Key: ARROW-5932 > URL: https://issues.apache.org/jira/browse/ARROW-5932 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0 > Environment: Linux Mint 19.1 Tessa > g++-6 >Reporter: Cong Ding >Priority: Critical > > I was installing Apache Arrow in my Linux Mint 19.1 Tessa server. I followed > the instructions on the official arrow website (using the ubuntu 18.04 > method). However, when I was trying to compile the examples, the g++ compiler > threw out some errors. > I have updated my g++ to g++-6, update my libstdc++ library, and using flag > -lstdc++, but it still didn't work. > > {code:java} > //代码占位符 > g++-6 -std=c++11 -larrow -lparquet main.cpp -lstdc++ > {code} > The error message: > /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to > `__cxa_init_primary_exception@CXXABI_1.3.11' > /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to > `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11' > collect2: error: ld returned 1 exit status. > > I do not know what to do this moment. Can anyone help me? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6122) [C++] ArgSort kernel must support FixedSizeBinary
[ https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6122: -- Summary: [C++] ArgSort kernel must support FixedSizeBinary (was: [C++] IsIn kernel must support FixedSizeBinary) > [C++] ArgSort kernel must support FixedSizeBinary > - > > Key: ARROW-6122 > URL: https://issues.apache.org/jira/browse/ARROW-6122 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6123) [C++] ArgSort kernel should not materialize the output internal
[ https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6123: -- Summary: [C++] ArgSort kernel should not materialize the output internal (was: [C++] IsIn kernel should not materialize the output internal) > [C++] ArgSort kernel should not materialize the output internal > --- > > Key: ARROW-6123 > URL: https://issues.apache.org/jira/browse/ARROW-6123 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > > It should use the helpers since the output size is known. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6124) [C++] IsIn kernel should sort in a single pass (with nulls)
Francois Saint-Jacques created ARROW-6124: - Summary: [C++] IsIn kernel should sort in a single pass (with nulls) Key: ARROW-6124 URL: https://issues.apache.org/jira/browse/ARROW-6124 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.15.0 Reporter: Francois Saint-Jacques There's a good chance that merge sort must be implemented (spill to disk, ChunkedArray, ...) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal
[ https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6123: -- Labels: (was: ana) > [C++] IsIn kernel should not materialize the output internal > > > Key: ARROW-6123 > URL: https://issues.apache.org/jira/browse/ARROW-6123 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > > It should use the helpers since the output size is known. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal
Francois Saint-Jacques created ARROW-6123: - Summary: [C++] IsIn kernel should not materialize the output internal Key: ARROW-6123 URL: https://issues.apache.org/jira/browse/ARROW-6123 Project: Apache Arrow Issue Type: Improvement Reporter: Francois Saint-Jacques It should use the helpers since the output size is known. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal
[ https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6123: -- Affects Version/s: 0.15.0 > [C++] IsIn kernel should not materialize the output internal > > > Key: ARROW-6123 > URL: https://issues.apache.org/jira/browse/ARROW-6123 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > > It should use the helpers since the output size is known. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal
[ https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6123: -- Labels: ana (was: ) > [C++] IsIn kernel should not materialize the output internal > > > Key: ARROW-6123 > URL: https://issues.apache.org/jira/browse/ARROW-6123 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > Labels: ana > > It should use the helpers since the output size is known. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6123) [C++] IsIn kernel should not materialize the output internal
[ https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-6123: -- Component/s: C++ > [C++] IsIn kernel should not materialize the output internal > > > Key: ARROW-6123 > URL: https://issues.apache.org/jira/browse/ARROW-6123 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.0 >Reporter: Francois Saint-Jacques >Priority: Major > > It should use the helpers since the output size is known. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6122) [C++] IsIn kernel must support FixedSizeBinary
Francois Saint-Jacques created ARROW-6122: - Summary: [C++] IsIn kernel must support FixedSizeBinary Key: ARROW-6122 URL: https://issues.apache.org/jira/browse/ARROW-6122 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.15.0 Reporter: Francois Saint-Jacques -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6121) [Tools] Improve merge tool cli ergonomic
[ https://issues.apache.org/jira/browse/ARROW-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6121: -- Labels: pull-request-available (was: ) > [Tools] Improve merge tool cli ergonomic > > > Key: ARROW-6121 > URL: https://issues.apache.org/jira/browse/ARROW-6121 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Trivial > Labels: pull-request-available > > * Accepts the pull-request number as an optional (first) parameter to the > script > * Supports reading the jira username/password from a file -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6121) [Tools] Improve merge tool cli ergonomic
Francois Saint-Jacques created ARROW-6121: - Summary: [Tools] Improve merge tool cli ergonomic Key: ARROW-6121 URL: https://issues.apache.org/jira/browse/ARROW-6121 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Francois Saint-Jacques Assignee: Francois Saint-Jacques * Accepts the pull-request number as an optional (first) parameter to the script * Supports reading the jira username/password from a file -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-1566) [C++] Implement non-materializing sort kernels
[ https://issues.apache.org/jira/browse/ARROW-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-1566: - Assignee: Artem Alekseev > [C++] Implement non-materializing sort kernels > -- > > Key: ARROW-1566 > URL: https://issues.apache.org/jira/browse/ARROW-1566 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Artem Alekseev >Priority: Major > Labels: Analytics, pull-request-available > Fix For: 0.15.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > The output of such operator would be a permutation vector that if applied to > a column, would result in the data being sorted like requested. This is > similar to numpy's argsort functionality. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-1566) [C++] Implement non-materializing sort kernels
[ https://issues.apache.org/jira/browse/ARROW-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-1566. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4861 [https://github.com/apache/arrow/pull/4861] > [C++] Implement non-materializing sort kernels > -- > > Key: ARROW-1566 > URL: https://issues.apache.org/jira/browse/ARROW-1566 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics, pull-request-available > Fix For: 0.15.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > The output of such operator would be a permutation vector that if applied to > a column, would result in the data being sorted like requested. This is > similar to numpy's argsort functionality. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6120) [C++][Gandiva] including some headers causes decimal_test to fail
Benjamin Kietzman created ARROW-6120: Summary: [C++][Gandiva] including some headers causes decimal_test to fail Key: ARROW-6120 URL: https://issues.apache.org/jira/browse/ARROW-6120 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Benjamin Kietzman It seems this is due to precompiled code being contaminated with undesired headers For example, {{#include }} in {{arrow/compare.h}} causes: {code} [ RUN ] TestDecimal.TestCastFunctions ../../src/gandiva/tests/decimal_test.cc:478: Failure Value of: (array_dec)->Equals(outputs[2], arrow::EqualOptions().nans_equal(true)) Actual: false Expected: true expected array: [ 1.23, 1.58, -1.23, -1.58 ] actual array: [ 0.00, 0.00, 0.00, 0.00 ] ../../src/gandiva/tests/decimal_test.cc:481: Failure Value of: (array_dec)->Equals(outputs[2], arrow::EqualOptions().nans_equal(true)) Actual: false Expected: true expected array: [ 1.23, 1.58, -1.23, -1.58 ] actual array: [ 0.00, 0.00, 0.00, 0.00 ] ../../src/gandiva/tests/decimal_test.cc:484: Failure Value of: (array_dec)->Equals(outputs[3], arrow::EqualOptions().nans_equal(true)) Actual: false Expected: true expected array: [ 1.23, 1.58, -1.23, -1.58 ] actual array: [ 0.00, 0.00, 0.00, 0.00 ] ../../src/gandiva/tests/decimal_test.cc:497: Failure Value of: (array_float64)->Equals(outputs[6], arrow::EqualOptions().nans_equal(true)) Actual: false Expected: true expected array: [ 1.23, 1.58, -1.23, -1.58 ] actual array: [ inf, inf, -inf, -inf ] [ FAILED ] TestDecimal.TestCastFunctions (134 ms) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899106#comment-16899106 ] Wes McKinney commented on ARROW-6119: - cc [~kszucs] > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899105#comment-16899105 ] Wes McKinney commented on ARROW-6119: - We pulled the 0.14.1 wheels because there was a different DLL load issue. I had thought the 0.14.0 wheels were working but I guess not. I hope someone can fix them before 0.15.0 > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899096#comment-16899096 ] Paul Suganthan commented on ARROW-6119: --- Installed using pip > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899091#comment-16899091 ] Uwe L. Korn commented on ARROW-6119: How did you install this? Did you use conda (preferred) or pip or did you compile it yourself? > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Suganthan updated ARROW-6119: -- Summary: [Python] PyArrow import fails on Windows Python 3.7 (was: PyArrow import fails on Windows Python 3.7) > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6119) PyArrow import fails on Windows Python 3.7
Paul Suganthan created ARROW-6119: - Summary: PyArrow import fails on Windows Python 3.7 Key: ARROW-6119 URL: https://issues.apache.org/jira/browse/ARROW-6119 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Environment: Windows, Python 3.7 Reporter: Paul Suganthan Traceback (most recent call last): File "", line 1, in File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in from pyarrow.lib import cpu_count, set_cpu_count ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-3325) [Python] Support reading Parquet binary/string columns directly as DictionaryArray
[ https://issues.apache.org/jira/browse/ARROW-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3325: -- Labels: parquet pull-request-available (was: parquet) > [Python] Support reading Parquet binary/string columns directly as > DictionaryArray > -- > > Key: ARROW-3325 > URL: https://issues.apache.org/jira/browse/ARROW-3325 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > Requires PARQUET-1324 and probably quite a bit of extra work > Properly implementing this will require dictionary normalization across row > groups. When reading a new row group, a fast path that compares the current > dictionary with the prior dictionary should be used. This also needs to > handle the case where a column chunk "fell back" to PLAIN encoding mid-stream -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5776) [Gandiva][Crossbow] Revert template to have commit ids.
[ https://issues.apache.org/jira/browse/ARROW-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-5776. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4738 [https://github.com/apache/arrow/pull/4738] > [Gandiva][Crossbow] Revert template to have commit ids. > --- > > Key: ARROW-5776 > URL: https://issues.apache.org/jira/browse/ARROW-5776 > Project: Apache Arrow > Issue Type: Bug >Reporter: Praveen Kumar Desabandu >Assignee: Praveen Kumar Desabandu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > We are dependent on the commit ids being present in the cross bow travis > templates so that we can sync our builds against the same commit id that was > used to create the artifacts. > So reverting back fetch-head to give back arrow-head. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898956#comment-16898956 ] lidavidm commented on ARROW-5610: - My apologies, I ended up being too busy to look at this. Thanks for the issue pointers. > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898927#comment-16898927 ] Joris Van den Bossche commented on ARROW-5610: -- {quote}I'll try to take a pass this week, if time permits; we would like this functionality{quote} Did you further look at this? {quote} By the way, is there a Jira explicitly for being able to hook into to_pandas, or a suggested way to efficiently do a custom Pandas conversion?) {quote} There is ARROW-2428 for this about a hook into {{to_pandas}} to specify a custom conversion (there is also ARROW-5271 for the other way around: be able to specify the final arrow array in pandas -> arrow conversion). > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898927#comment-16898927 ] Joris Van den Bossche edited comment on ARROW-5610 at 8/2/19 2:48 PM: -- {quote}I'll try to take a pass this week, if time permits; we would like this functionality{quote} [~lidavidm] Did you further look at this? {quote} By the way, is there a Jira explicitly for being able to hook into to_pandas, or a suggested way to efficiently do a custom Pandas conversion?) {quote} There is ARROW-2428 for this about a hook into {{to_pandas}} to specify a custom conversion (there is also ARROW-5271 for the other way around: be able to specify the final arrow array in pandas -> arrow conversion). was (Author: jorisvandenbossche): {quote}I'll try to take a pass this week, if time permits; we would like this functionality{quote} Did you further look at this? {quote} By the way, is there a Jira explicitly for being able to hook into to_pandas, or a suggested way to efficiently do a custom Pandas conversion?) {quote} There is ARROW-2428 for this about a hook into {{to_pandas}} to specify a custom conversion (there is also ARROW-5271 for the other way around: be able to specify the final arrow array in pandas -> arrow conversion). > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5876) [FlightRPC] Implement basic auth across all languages
[ https://issues.apache.org/jira/browse/ARROW-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Murray reassigned ARROW-5876: -- Assignee: Ryan Murray > [FlightRPC] Implement basic auth across all languages > - > > Key: ARROW-5876 > URL: https://issues.apache.org/jira/browse/ARROW-5876 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Affects Versions: 0.14.0 >Reporter: lidavidm >Assignee: Ryan Murray >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > We should implement a set of common auth methods in Flight itself to have > standardized ways to do things like basic auth. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers
[ https://issues.apache.org/jira/browse/ARROW-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898898#comment-16898898 ] Nick Poorman commented on ARROW-6107: - https://issues.apache.org/jira/browse/ARROW-4852 Is the same use case I'm thinking of. If you have an Arrow Table in C (or Python) and you want to access the data in Go, you can pass a pointer back from C to the underlying data buffers. However, you still have to collect all the metadata to utilize the buffers. Making CGO calls is slow, so being able to pass a pointer to the data buffers and a pointer to the serialized metadata would ensure a more constant time when crossing the language boundary. I did a simple POC to demonstrate what it would take to collect all the information from Python and re-materialize it in Go. [https://github.com/nickpoorman/go-py-arrow-bridge] The bottleneck is the number of CGO calls required to fetch all the metadata. > [Go] ipc.Writer Option to skip appending data buffers > - > > Key: ARROW-6107 > URL: https://issues.apache.org/jira/browse/ARROW-6107 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Nick Poorman >Priority: Minor > > For cases where we have a known shared memory region, it would be great if > the ipc.Writer (and by extension ipc.Reader?) had the ability to write out > everything but the actual buffers holding the data. That way we can still > utilize the ipc mechanisms to communicate without having to serialize all the > underlying data across the wire. > > This seems like it should be possible since the `RecordBatch` flatbuffers > only contain the metadata and the underlying data buffers are appended later. > We just need to skip appending the underlying data buffers. > > [~sbinet] thoughts? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6107) [Go] ipc.Writer Option to skip appending data buffers
[ https://issues.apache.org/jira/browse/ARROW-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898880#comment-16898880 ] Sebastien Binet commented on ARROW-6107: not saying it wouldn't be advisable nor doable, but: if it's already in a shmem region, why not just use that already? (and I guess it's kind of implementing: https://issues.apache.org/jira/browse/ARROW-4852) > [Go] ipc.Writer Option to skip appending data buffers > - > > Key: ARROW-6107 > URL: https://issues.apache.org/jira/browse/ARROW-6107 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Nick Poorman >Priority: Minor > > For cases where we have a known shared memory region, it would be great if > the ipc.Writer (and by extension ipc.Reader?) had the ability to write out > everything but the actual buffers holding the data. That way we can still > utilize the ipc mechanisms to communicate without having to serialize all the > underlying data across the wire. > > This seems like it should be possible since the `RecordBatch` flatbuffers > only contain the metadata and the underlying data buffers are appended later. > We just need to skip appending the underlying data buffers. > > [~sbinet] thoughts? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5876) [FlightRPC] Implement basic auth across all languages
[ https://issues.apache.org/jira/browse/ARROW-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5876: -- Labels: pull-request-available (was: ) > [FlightRPC] Implement basic auth across all languages > - > > Key: ARROW-5876 > URL: https://issues.apache.org/jira/browse/ARROW-5876 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Affects Versions: 0.14.0 >Reporter: lidavidm >Priority: Major > Labels: pull-request-available > > We should implement a set of common auth methods in Flight itself to have > standardized ways to do things like basic auth. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6106) Scala lang support
[ https://issues.apache.org/jira/browse/ARROW-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898878#comment-16898878 ] Wes McKinney commented on ARROW-6106: - You might want to discuss this on the mailing list > Scala lang support > -- > > Key: ARROW-6106 > URL: https://issues.apache.org/jira/browse/ARROW-6106 > Project: Apache Arrow > Issue Type: Wish >Reporter: Boris V.Kuznetsov >Priority: Major > > I ported the testArrowStream.java to Scala Specs2 and added to the PR > Pls see more details in my [PR |https://github.com/apache/arrow/pull/4989] > I'm ready to port other tests as well and add SBT file > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6069) [Rust] [Parquet] Implement Converter to convert record reader to arrow primitive array.
[ https://issues.apache.org/jira/browse/ARROW-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6069: -- Labels: pull-request-available (was: ) > [Rust] [Parquet] Implement Converter to convert record reader to arrow > primitive array. > --- > > Key: ARROW-6069 > URL: https://issues.apache.org/jira/browse/ARROW-6069 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions
[ https://issues.apache.org/jira/browse/ARROW-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6118: -- Labels: pull-request-available (was: ) > [Java] Replace google Preconditions with Arrow Preconditions > > > Key: ARROW-6118 > URL: https://issues.apache.org/jira/browse/ARROW-6118 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Critical > Labels: pull-request-available > > Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, > but still some places uses {{com.google.common.base.Preconditions}}. > Remove google Preconditions meanwhile remove duplicated checks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6118) [Java] Replace google Preconditions with Arrow Preconditions
Ji Liu created ARROW-6118: - Summary: [Java] Replace google Preconditions with Arrow Preconditions Key: ARROW-6118 URL: https://issues.apache.org/jira/browse/ARROW-6118 Project: Apache Arrow Issue Type: Improvement Reporter: Ji Liu Assignee: Ji Liu Now in java code, most places uses {{org.apache.arrow.util.Preconditions}}, but still some places uses {{com.google.common.base.Preconditions}}. Remove google Preconditions meanwhile remove duplicated checks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently
[ https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898787#comment-16898787 ] Joris Van den Bossche commented on ARROW-5682: -- This seems to be specific to the code paths dealing with numpy arrays, as from built-in python objects, you get a logical error: {code} In [9]: pa.array([1, 2, 3], pa.string()) ... ArrowTypeError: Expected a string or bytes object, got a 'int' object In [10]: pa.array(np.array([1, 2, 3]), pa.string()) Out[10]: [ "", # <-- this is actually not an empty string but '\x01' "", "" ] {code} I agree that at least an error should be raised instead of those incorrect values. In numpy you can cast ints to their string representation by doing an equivalent call: {code} In [13]: np.array(np.array([1, 2, 3], dtype=int), dtype=str) Out[13]: array(['1', '2', '3'], dtype=' [Python] from_pandas conversion casts values to string inconsistently > - > > Key: ARROW-5682 > URL: https://issues.apache.org/jira/browse/ARROW-5682 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Bryan Cutler >Priority: Minor > > When calling {{pa.Array.from_pandas}} primitive data as input, and casting to > string with "type=pa.string()", the resulting pyarrow Array can have > inconsistent values. For most input, the result is an empty string, however > for some types (int32, int64) the values are '\x01' etc. > {noformat} > In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8) > In [9]: pa.Array.from_pandas(s, type=pa.string()) > > Out[9]: > > [ > "", > "", > "" > ] > In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32) > > In [11]: pa.Array.from_pandas(s, type=pa.string()) > > Out[11]: > > [ > "", > "", > "" > ] > {noformat} > This came from the Spark discussion > https://github.com/apache/spark/pull/24930/files#r296187903. Type casting > this way in Spark is not supported, but it would be good to get the behavior > consistent. Would it be better to raise an UnsupportedOperation error? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5682) [Python] from_pandas conversion casts values to string inconsistently
[ https://issues.apache.org/jira/browse/ARROW-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-5682: - Issue Type: Bug (was: Improvement) > [Python] from_pandas conversion casts values to string inconsistently > - > > Key: ARROW-5682 > URL: https://issues.apache.org/jira/browse/ARROW-5682 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 >Reporter: Bryan Cutler >Priority: Minor > > When calling {{pa.Array.from_pandas}} primitive data as input, and casting to > string with "type=pa.string()", the resulting pyarrow Array can have > inconsistent values. For most input, the result is an empty string, however > for some types (int32, int64) the values are '\x01' etc. > {noformat} > In [8]: s = pd.Series([1, 2, 3], dtype=np.uint8) > In [9]: pa.Array.from_pandas(s, type=pa.string()) > > Out[9]: > > [ > "", > "", > "" > ] > In [10]: s = pd.Series([1, 2, 3], dtype=np.uint32) > > In [11]: pa.Array.from_pandas(s, type=pa.string()) > > Out[11]: > > [ > "", > "", > "" > ] > {noformat} > This came from the Spark discussion > https://github.com/apache/spark/pull/24930/files#r296187903. Type casting > this way in Spark is not supported, but it would be good to get the behavior > consistent. Would it be better to raise an UnsupportedOperation error? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector
[ https://issues.apache.org/jira/browse/ARROW-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6117: -- Labels: pull-request-available (was: ) > [Java] Fix the set method of FixedSizeBinaryVector > -- > > Key: ARROW-6117 > URL: https://issues.apache.org/jira/browse/ARROW-6117 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > > For the set method, if the parameter is null, it should clear the validity > bit. However, the current implementation throws a NullPointerException. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6025) [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 tests
[ https://issues.apache.org/jira/browse/ARROW-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898778#comment-16898778 ] Pindikura Ravindra commented on ARROW-6025: --- thanks [~kszucs] - we'll use this Jira to handle missing timezones. I believe we already hit this on windows too, and disabled the tests there. > [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 > tests > --- > > Key: ARROW-6025 > URL: https://issues.apache.org/jira/browse/ARROW-6025 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Krisztian Szucs >Assignee: Prudhvi Porandla >Priority: Major > > I've recently enabled gandiva in the conda c++ ursabot builders. The > container doesn't contain the required timezones do the tests are failing: > {code} > ../src/gandiva/precompiled/time_test.cc:103: Failure > Expected equality of these values: > castTIMESTAMP_utf8(context_ptr, "2000-09-23 9:45:30.920 Canada/Pacific", 37) > Which is: 0 > 969727530920 > ../src/gandiva/precompiled/time_test.cc:105: Failure > Expected equality of these values: > castTIMESTAMP_utf8(context_ptr, "2012-02-28 23:30:59 Asia/Kolkata", 32) > Which is: 0 > 1330452059000 > ../src/gandiva/precompiled/time_test.cc:107: Failure > Expected equality of these values: > castTIMESTAMP_utf8(context_ptr, "1923-10-07 03:03:03 America/New_York", 36) > Which is: 0 > -1459094217000 > {code} > See build: > https://ci.ursalabs.org/#/builders/66/builds/3046/steps/8/logs/stdio -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6025) [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 tests
[ https://issues.apache.org/jira/browse/ARROW-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra reassigned ARROW-6025: - Assignee: Prudhvi Porandla > [Gandiva][Test] Error handling for missing timezone in castTIMESTAMP_utf8 > tests > --- > > Key: ARROW-6025 > URL: https://issues.apache.org/jira/browse/ARROW-6025 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Krisztian Szucs >Assignee: Prudhvi Porandla >Priority: Major > > I've recently enabled gandiva in the conda c++ ursabot builders. The > container doesn't contain the required timezones do the tests are failing: > {code} > ../src/gandiva/precompiled/time_test.cc:103: Failure > Expected equality of these values: > castTIMESTAMP_utf8(context_ptr, "2000-09-23 9:45:30.920 Canada/Pacific", 37) > Which is: 0 > 969727530920 > ../src/gandiva/precompiled/time_test.cc:105: Failure > Expected equality of these values: > castTIMESTAMP_utf8(context_ptr, "2012-02-28 23:30:59 Asia/Kolkata", 32) > Which is: 0 > 1330452059000 > ../src/gandiva/precompiled/time_test.cc:107: Failure > Expected equality of these values: > castTIMESTAMP_utf8(context_ptr, "1923-10-07 03:03:03 America/New_York", 36) > Which is: 0 > -1459094217000 > {code} > See build: > https://ci.ursalabs.org/#/builders/66/builds/3046/steps/8/logs/stdio -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector
Liya Fan created ARROW-6117: --- Summary: [Java] Fix the set method of FixedSizeBinaryVector Key: ARROW-6117 URL: https://issues.apache.org/jira/browse/ARROW-6117 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Liya Fan Assignee: Liya Fan For the set method, if the parameter is null, it should clear the validity bit. However, the current implementation throws a NullPointerException. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6116) [C++][Gandiva] Fix bug in TimedTestFilterAdd2
[ https://issues.apache.org/jira/browse/ARROW-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-6116. --- Resolution: Fixed Fix Version/s: 0.15.0 > [C++][Gandiva] Fix bug in TimedTestFilterAdd2 > - > > Key: ARROW-6116 > URL: https://issues.apache.org/jira/browse/ARROW-6116 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Pindikura Ravindra >Priority: Major > Fix For: 0.15.0 > > > The tests should be : f0 + f1 < f2, instead it's doing f1 + f2 < f2. This was > reported via a PR > > [https://github.com/apache/arrow/pull/4976] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898754#comment-16898754 ] Pindikura Ravindra edited comment on ARROW-6112 at 8/2/19 10:02 AM: sorry, i mistakenly put this Jira ID for an [unrelated PR|https://github.com/apache/arrow/pull/4976] - fixed now. was (Author: pravindra): sorry, i mistakenly put this Jira ID for an [unrelated PR |[https://github.com/apache/arrow/pull/4976]]- fixed now. > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898754#comment-16898754 ] Pindikura Ravindra commented on ARROW-6112: --- sorry, i mistakenly put this Jira ID for an [unrelated PR |[https://github.com/apache/arrow/pull/4976]]- fixed now. > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Reopened] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra reopened ARROW-6112: --- > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra updated ARROW-6112: -- Fix Version/s: (was: 0.15.0) > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Issue Comment Deleted] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra updated ARROW-6112: -- Comment: was deleted (was: Issue resolved by pull request 4976 [https://github.com/apache/arrow/pull/4976]) > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6112: -- Labels: pull-request-available (was: ) > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-6112. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4976 [https://github.com/apache/arrow/pull/4976] > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Fix For: 0.15.0 > > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5917) [Java] Redesign the dictionary encoder
[ https://issues.apache.org/jira/browse/ARROW-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5917: -- Labels: pull-request-available (was: ) > [Java] Redesign the dictionary encoder > -- > > Key: ARROW-5917 > URL: https://issues.apache.org/jira/browse/ARROW-5917 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > The current dictionary encoder implementation > (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance > overhead, which prevents it from being useful in practice: > # There are repeated conversions between Java objects and bytes (e.g. > vector.getObject(i)). > # Unnecessary memory copy (the vector data must be copied to the hash table). > # The hash table cannot be reused for encoding multiple vectors (other data > structure & results cannot be reused either). > # The output vector should not be created/managed by the encoder (just like > in the out-of-place sorter) > # The hash table requires that the hashCode & equals methods be implemented > appropriately, but this is not guaranteed. > We plan to implement a new one in the algorithm module, and gradually > deprecate the current one. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`
[ https://issues.apache.org/jira/browse/ARROW-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra reassigned ARROW-6002: - Assignee: Benjamin Kietzman > [C++][Gandiva] TestCastFunctions does not test int64 casting` > - > > Key: ARROW-6002 > URL: https://issues.apache.org/jira/browse/ARROW-6002 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > {{outputs[2]}} (corresponds to cast from float32) is checked twice > https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478 > while {{outputs[1]}} is not checked (corresponds to cast from int64) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6002) [C++][Gandiva] TestCastFunctions does not test int64 casting`
[ https://issues.apache.org/jira/browse/ARROW-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pindikura Ravindra resolved ARROW-6002. --- Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4991 [https://github.com/apache/arrow/pull/4991] > [C++][Gandiva] TestCastFunctions does not test int64 casting` > - > > Key: ARROW-6002 > URL: https://issues.apache.org/jira/browse/ARROW-6002 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Benjamin Kietzman >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > {{outputs[2]}} (corresponds to cast from float32) is checked twice > https://github.com/apache/arrow/pull/4817/files#diff-2e911c4dcae01ea2d3ce200892a0179aR478 > while {{outputs[1]}} is not checked (corresponds to cast from int64) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6114) [Python] Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
[ https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6114: - Summary: [Python] Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow (was: Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow) > [Python] Datatypes are not preserved when a pandas dataframe partitioned and > saved as parquet file using pyarrow > > > Key: ARROW-6114 > URL: https://issues.apache.org/jira/browse/ARROW-6114 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: Python 3.7.3 > pyarrow 0.14.1 >Reporter: Naga >Priority: Major > Labels: dataset, parquet > > h3. Datatypes are not preserved when a pandas data frame is *partitioned* and > saved as parquet file using pyarrow but that's not the case when the data > frame is not partitioned. > *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* > {code:java} > # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow > import pandas as pd > df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test' > partition_cols=['age'] > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, partition_cols=partition_cols, > preserve_index=False) > # Loading a dataset partioned parquet dataset from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > name object > age category > dtype: object > {code} > h5. {color:#d04437}From the above output, we could see that the data type for > age is int64 in the original pandas data frame but it got changed to category > when we saved to local and loaded back.{color} > *Case 2: Non-partitioned dataset - Data types are preserved* > {code:java} > import pandas as pd > print('Saving a Pandas Dataframe to Local as a parquet file without > partitioning using pyarrow') > df = pd.DataFrame( > {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test_without_partition' > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, preserve_index=False) > # Loading a non-partioned parquet file from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Saving a Pandas Dataframe to Local as a parquet file without partitioning > using pyarrow > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > age int64 > name object > dtype: object > {code} > *Versions* > * Python 3.7.3 > * pyarrow 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
[ https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6114: - Labels: dataset parquet (was: parquet) > Datatypes are not preserved when a pandas dataframe partitioned and saved as > parquet file using pyarrow > --- > > Key: ARROW-6114 > URL: https://issues.apache.org/jira/browse/ARROW-6114 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: Python 3.7.3 > pyarrow 0.14.1 >Reporter: Naga >Priority: Major > Labels: dataset, parquet > > h3. Datatypes are not preserved when a pandas data frame is *partitioned* and > saved as parquet file using pyarrow but that's not the case when the data > frame is not partitioned. > *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* > {code:java} > # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow > import pandas as pd > df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test' > partition_cols=['age'] > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, partition_cols=partition_cols, > preserve_index=False) > # Loading a dataset partioned parquet dataset from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > name object > age category > dtype: object > {code} > h5. {color:#d04437}From the above output, we could see that the data type for > age is int64 in the original pandas data frame but it got changed to category > when we saved to local and loaded back.{color} > *Case 2: Non-partitioned dataset - Data types are preserved* > {code:java} > import pandas as pd > print('Saving a Pandas Dataframe to Local as a parquet file without > partitioning using pyarrow') > df = pd.DataFrame( > {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test_without_partition' > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, preserve_index=False) > # Loading a non-partioned parquet file from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Saving a Pandas Dataframe to Local as a parquet file without partitioning > using pyarrow > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > age int64 > name object > dtype: object > {code} > *Versions* > * Python 3.7.3 > * pyarrow 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
[ https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898737#comment-16898737 ] Joris Van den Bossche commented on ARROW-6114: -- [~bnriiitb] thanks for opening the issue. So when a partitioned dataset is written, the partition columns are not stored in the actual data, but are part of the directory schema (in your case you would have "age=77", "age=32", etc sub-folders). Currently, we don't save any "meta data" about the columns used to partition, and since they are also not stored in the actual parquet files (where a schema of the data is stored), we don't have that information from there either. So when reading a partitioned dataset, (py)arrow has not much information about the type of this partition column. Currently, the logic is to try to convert the values to ints and otherwise leave as strings, and then those values are converted to a Dictionary type (corresponding to categorical type in pandas). This logic is here: https://github.com/apache/arrow/blob/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5/python/pyarrow/parquet.py#L585-L609 There is currently no option to change this. So right now, the workaround is to convert the categorical back to an integer column in pandas. But longer term, we should maybe think about storing the type of the partition keys as metadata, and an option to restore it as a dictionary column or not. Related issues about the type of the partition column: ARROW-3388 (booleans as strings), ARROW-5666 (strings with underscores interpreted as int) > Datatypes are not preserved when a pandas dataframe partitioned and saved as > parquet file using pyarrow > --- > > Key: ARROW-6114 > URL: https://issues.apache.org/jira/browse/ARROW-6114 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: Python 3.7.3 > pyarrow 0.14.1 >Reporter: Naga >Priority: Major > Labels: parquet > > h3. Datatypes are not preserved when a pandas data frame is *partitioned* and > saved as parquet file using pyarrow but that's not the case when the data > frame is not partitioned. > *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* > {code:java} > # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow > import pandas as pd > df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test' > partition_cols=['age'] > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, partition_cols=partition_cols, > preserve_index=False) > # Loading a dataset partioned parquet dataset from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > name object > age category > dtype: object > {code} > h5. {color:#d04437}From the above output, we could see that the data type for > age is int64 in the original pandas data frame but it got changed to category > when we saved to local and loaded back.{color} > *Case 2: Non-partitioned dataset - Data types are preserved* > {code:java} > import pandas as pd > print('Saving a Pandas Dataframe to Local as a parquet file without > partitioning using pyarrow') > df = pd.DataFrame( > {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test_without_partition' > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, preserve_index=False) > # Loading a non-partioned parquet file from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Saving a Pandas Dataframe to Local as a parquet file without partitioning > using pyarrow > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > age int64 > name object > dtype: object > {code} > *Versions* > * Python 3.7.3 > * pyarrow 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet
[ https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898725#comment-16898725 ] Joris Van den Bossche commented on ARROW-5480: -- {quote}One slightly higher level issue is the extent to which we store Arrow schema information in the Parquet metadata. {quote} Possibly related to ARROW-5888, where we also need to store arrow-specific metadata for faithful roundtrip (in that case the timezone). Spark stores the all column types (and optional column metadata) in the key_value_metadata of the FileMetadata: For example for a file with a single int column {code} >>> meta = pq.read_metadata('test_pyspark_dataset/_metadata') >>> meta.metadata {b'org.apache.spark.sql.parquet.row.metadata': b'{"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]}'} {code} > [Python] Pandas categorical type doesn't survive a round-trip through parquet > - > > Key: ARROW-5480 > URL: https://issues.apache.org/jira/browse/ARROW-5480 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python: 3.7.3.final.0 > python-bits: 64 > OS: Linux > OS-release: 5.0.0-15-generic > machine: x86_64 > processor: x86_64 > byteorder: little > pandas: 0.24.2 > numpy: 1.16.4 > pyarrow: 0.13.0 >Reporter: Karl Dunkle Werner >Priority: Minor > > Writing a string categorical variable to from pandas parquet is read back as > string (object dtype). I expected it to be read as category. > The same thing happens if the category is numeric -- a numeric category is > read back as int64. > In the code below, I tried out an in-memory arrow Table, which successfully > translates categories back to pandas. However, when I write to a parquet > file, it's not. > In the scheme of things, this isn't a big deal, but it's a small surprise. > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])}) > df.dtypes # category > # This works: > pa.Table.from_pandas(df).to_pandas().dtypes # category > df.to_parquet("categories.parquet") > # This reads back object, but I expected category > pd.read_parquet("categories.parquet").dtypes # object > # Numeric categories have the same issue: > df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])}) > df_num.dtypes # category > pa.Table.from_pandas(df_num).to_pandas().dtypes # category > df_num.to_parquet("categories_num.parquet") > # This reads back int64, but I expected category > pd.read_parquet("categories_num.parquet").dtypes # int64 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6114) Datatypes are not preserved when a pandas dataframe partitioned and saved as parquet file using pyarrow
[ https://issues.apache.org/jira/browse/ARROW-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6114: - Labels: parquet (was: ) > Datatypes are not preserved when a pandas dataframe partitioned and saved as > parquet file using pyarrow > --- > > Key: ARROW-6114 > URL: https://issues.apache.org/jira/browse/ARROW-6114 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: Python 3.7.3 > pyarrow 0.14.1 >Reporter: Naga >Priority: Major > Labels: parquet > > h3. Datatypes are not preserved when a pandas data frame is *partitioned* and > saved as parquet file using pyarrow but that's not the case when the data > frame is not partitioned. > *Case 1: Saving a partitioned dataset - Data Types are NOT preserved* > {code:java} > # Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow > import pandas as pd > df = pd.DataFrame( {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test' > partition_cols=['age'] > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, partition_cols=partition_cols, > preserve_index=False) > # Loading a dataset partioned parquet dataset from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > name object > age category > dtype: object > {code} > h5. {color:#d04437}From the above output, we could see that the data type for > age is int64 in the original pandas data frame but it got changed to category > when we saved to local and loaded back.{color} > *Case 2: Non-partitioned dataset - Data types are preserved* > {code:java} > import pandas as pd > print('Saving a Pandas Dataframe to Local as a parquet file without > partitioning using pyarrow') > df = pd.DataFrame( > {'age': [77,32,234],'name':['agan','bbobby','test'] } > ) > path = 'test_without_partition' > print('Datatypes before saving the dataset') > print(df.dtypes) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, path, preserve_index=False) > # Loading a non-partioned parquet file from local > df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas() > print('\nDatatypes after loading the dataset') > print(df.dtypes) > {code} > *Output:* > {code:java} > Saving a Pandas Dataframe to Local as a parquet file without partitioning > using pyarrow > Datatypes before saving the dataset > age int64 > name object > dtype: object > Datatypes after loading the dataset > age int64 > name object > dtype: object > {code} > *Versions* > * Python 3.7.3 > * pyarrow 0.14.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6115) [Python] support LargeList, LargeString, LargeBinary in conversion to pandas
Joris Van den Bossche created ARROW-6115: Summary: [Python] support LargeList, LargeString, LargeBinary in conversion to pandas Key: ARROW-6115 URL: https://issues.apache.org/jira/browse/ARROW-6115 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche General python support for those 3 new types has been added: ARROW-6000, ARROW-6084 However, one aspect that is not yet implemented is conversion to pandas (or numpy array): {code} In [67]: a = pa.array(['a', 'b', 'c'], pa.large_string()) In [68]: a.to_pandas() ... ArrowNotImplementedError: large_utf8 In [69]: pa.table({'a': a}).to_pandas() ... ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type large_string is known. {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6106) Scala lang support
[ https://issues.apache.org/jira/browse/ARROW-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898673#comment-16898673 ] Boris V.Kuznetsov commented on ARROW-6106: -- You may run those tests from my integration project: {{https://github.com/Neurodyne/apache-arrow-parquet}} > Scala lang support > -- > > Key: ARROW-6106 > URL: https://issues.apache.org/jira/browse/ARROW-6106 > Project: Apache Arrow > Issue Type: Wish >Reporter: Boris V.Kuznetsov >Priority: Major > > I ported the testArrowStream.java to Scala Specs2 and added to the PR > Pls see more details in my [PR |https://github.com/apache/arrow/pull/4989] > I'm ready to port other tests as well and add SBT file > -- This message was sent by Atlassian JIRA (v7.6.14#76016)