[jira] [Commented] (ARROW-6282) Support lossy compression
[ https://issues.apache.org/jira/browse/ARROW-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909870#comment-16909870 ] Micah Kornfield commented on ARROW-6282: [~domoritz] it is definitely worth discussing your implementation plans/design on ML before getting too far, especially if this will require changes to the IPC specification. > Support lossy compression > - > > Key: ARROW-6282 > URL: https://issues.apache.org/jira/browse/ARROW-6282 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Dominik Moritz >Priority: Major > > Arrow dataframes with large columns of integers or floats can be compressed > using gzip or brotli. However, in some cases it will be okay to compress the > data lossy to achieve even higher compression ratios. The main use case for > this is visualization where small inaccuracies matter less. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6267) [Ruby] Add Arrow::Time for Arrow::Time{32,64}DataType value
[ https://issues.apache.org/jira/browse/ARROW-6267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro resolved ARROW-6267. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5102 [https://github.com/apache/arrow/pull/5102] > [Ruby] Add Arrow::Time for Arrow::Time{32,64}DataType value > --- > > Key: ARROW-6267 > URL: https://issues.apache.org/jira/browse/ARROW-6267 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6282) Support lossy compression
[ https://issues.apache.org/jira/browse/ARROW-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909851#comment-16909851 ] Brian Hulette commented on ARROW-6282: -- Great idea! I think right now we only support compressing entire record record batches, to make this work would need buffer-level compression so that we could just compress the floating-point buffers. [~emkornfi...@gmail.com] did write up a proposal that included buffer-level compression, among other things: [strawman PR|https://github.com/apache/arrow/pull/4815], [ML discussion|https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E] > Support lossy compression > - > > Key: ARROW-6282 > URL: https://issues.apache.org/jira/browse/ARROW-6282 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Dominik Moritz >Priority: Major > > Arrow dataframes with large columns of integers or floats can be compressed > using gzip or brotli. However, in some cases it will be okay to compress the > data lossy to achieve even higher compression ratios. The main use case for > this is visualization where small inaccuracies matter less. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6270) [C++][Fuzzing] IPC reads do not check buffer indices
[ https://issues.apache.org/jira/browse/ARROW-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6270. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5105 [https://github.com/apache/arrow/pull/5105] > [C++][Fuzzing] IPC reads do not check buffer indices > > > Key: ARROW-6270 > URL: https://issues.apache.org/jira/browse/ARROW-6270 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Marco Neumann >Assignee: Marco Neumann >Priority: Major > Labels: fuzzer, pull-request-available > Fix For: 0.15.0 > > Attachments: crash-bd7e00178af2d236fdf041fcc1fb30975bf8fbca > > Time Spent: 40m > Remaining Estimate: 0h > > The attached crash was found by {{arrow-ipc-fuzzing-test}} and indicates that > the IPC reader is not checking the flatbuffer encoded buffers for length and > can produce out-of-bounds-reads. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5085) [Python/C++] Conversion of dict encoded null column fails in parquet writing when using RowGroups
[ https://issues.apache.org/jira/browse/ARROW-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5085. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5107 [https://github.com/apache/arrow/pull/5107] > [Python/C++] Conversion of dict encoded null column fails in parquet writing > when using RowGroups > - > > Key: ARROW-5085 > URL: https://issues.apache.org/jira/browse/ARROW-5085 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.13.0 >Reporter: Florian Jetter >Assignee: Wes McKinney >Priority: Minor > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Conversion of dict encoded null column fails in parquet writing when using > RowGroups > {code:python} > import pyarrow.parquet as pq > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"col": [None] * 100, "int": [1.0] * 100}) > df = df.astype({"col": "category"}) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table( > table, > buf, > version="2.0", > chunk_size=10, > ) > {code} > fails with > {{pyarrow.lib.ArrowIOError: Column 2 had 100 while previous column had 10}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5028) [Python][C++] Creating list with pyarrow.array can overflow child builder
[ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5028. - Resolution: Fixed Issue resolved by pull request 5108 [https://github.com/apache/arrow/pull/5108] > [Python][C++] Creating list with pyarrow.array can overflow child > builder > - > > Key: ARROW-5028 > URL: https://issues.apache.org/jira/browse/ARROW-5028 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1, 0.13.0 > Environment: python 3.6 >Reporter: Marco Neumann >Assignee: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > Attachments: dct.json.gz, dct.pickle.gz > > Time Spent: 40m > Remaining Estimate: 0h > > I am sorry if this bugs feels rather long and the reproduction data is large, > but I was not able to reduce the data even further while still triggering the > problem. I was able to trigger this behavior on master and on {{0.11.1}}. > {code:python} > import io > import os.path > import pickle > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > def dct_to_table(index_dct): > labeled_array = pa.array(np.array(list(index_dct.keys( > partition_array = pa.array(np.array(list(index_dct.values( > return pa.Table.from_arrays( > [labeled_array, partition_array], names=['a', 'b'] > ) > def check_pq_nulls(data): > fp = io.BytesIO(data) > pfile = pq.ParquetFile(fp) > assert pfile.num_row_groups == 1 > md = pfile.metadata.row_group(0) > col = md.column(1) > assert col.path_in_schema == 'b.list.item' > assert col.statistics.null_count == 0 # fails > def roundtrip(table): > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > data = buf.getvalue().to_pybytes() > # this fails: > # check_pq_nulls(data) > reader = pa.BufferReader(data) > return pq.read_table(reader) > with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp: > dct = pickle.load(fp) > # this does NOT help: > # pa.set_cpu_count(1) > # import gc; gc.disable() > table = dct_to_table(dct) > # this fixes the issue: > # table = pa.Table.from_pandas(table.to_pandas()) > table2 = roundtrip(table) > assert table.column('b').null_count == 0 > assert table2.column('b').null_count == 0 # fails > # if table2 is converted to pandas, you can also observe that some values at > the end of column b are `['']` which clearly is not present in the original > data > {code} > I would also be thankful for any pointers on where the bug comes from or on > who to reduce the test case. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6287) [Rust] [DataFusion] Refactor TableProvider to return thread-safe BatchIterator
[ https://issues.apache.org/jira/browse/ARROW-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6287: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Refactor TableProvider to return thread-safe BatchIterator > -- > > Key: ARROW-6287 > URL: https://issues.apache.org/jira/browse/ARROW-6287 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > This refactor is a step towards implementing the new query execution that > supports partitions and parallel execution. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6287) [Rust] [DataFusion] Refactor TableProvider to return thread-safe BatchIterator
Andy Grove created ARROW-6287: - Summary: [Rust] [DataFusion] Refactor TableProvider to return thread-safe BatchIterator Key: ARROW-6287 URL: https://issues.apache.org/jira/browse/ARROW-6287 Project: Apache Arrow Issue Type: Sub-task Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.15.0 This refactor is a step towards implementing the new query execution that supports partitions and parallel execution. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6282) Support lossy compression
[ https://issues.apache.org/jira/browse/ARROW-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909778#comment-16909778 ] Martin Radev commented on ARROW-6282: - Hello Dominik, are you going to work on this new feature? I actually already began working on this feature though not directly for Arrow. In particular, my work focuses on investigating, designing, proposing and implementing an extension to Apache Parquet for support of lossy and lossless floating-point compression. I had an initial report which can be read here: [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing] I investigated two lossy compressors: ZFP and SZ. I concluded that, despite SZ's better compression ratio, it cannot be introduced to Parquet since the implementation is not mature enough - the API is poorly designed, it is not thread safe and I observed two segfaults locally. Developers have been also slow to correspond. For example, this issue I opened has not led to any discussion [https://github.com/disheng222/SZ/issues/29] ZFP seems to be a safer choice for using it in Parquet. There's still some consideration on the way the user should specify how data should be discarded. In particular, it should be designed in such a way that other lossy compressors can be added in the future. These are the error modes I've observed in my investigation: absolute error, relative error, number of mantissa bits to discard. Can you please share at what stage are you currently in working on the feature? I think we can collaborate. > Support lossy compression > - > > Key: ARROW-6282 > URL: https://issues.apache.org/jira/browse/ARROW-6282 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Dominik Moritz >Priority: Major > > Arrow dataframes with large columns of integers or floats can be compressed > using gzip or brotli. However, in some cases it will be okay to compress the > data lossy to achieve even higher compression ratios. The main use case for > this is visualization where small inaccuracies matter less. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6286) [GLib] Add support for LargeList type
Yosuke Shiro created ARROW-6286: --- Summary: [GLib] Add support for LargeList type Key: ARROW-6286 URL: https://issues.apache.org/jira/browse/ARROW-6286 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6285) [GLib] Add support for LargeBinary and LargeString types
[ https://issues.apache.org/jira/browse/ARROW-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6285: -- Labels: pull-request-available (was: ) > [GLib] Add support for LargeBinary and LargeString types > > > Key: ARROW-6285 > URL: https://issues.apache.org/jira/browse/ARROW-6285 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6285) [GLib] Add support for LargeBinary and LargeString types
Yosuke Shiro created ARROW-6285: --- Summary: [GLib] Add support for LargeBinary and LargeString types Key: ARROW-6285 URL: https://issues.apache.org/jira/browse/ARROW-6285 Project: Apache Arrow Issue Type: New Feature Components: GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6284) [C++] Allow references in std::tuple when converting tuple to arrow array
[ https://issues.apache.org/jira/browse/ARROW-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6284: -- Labels: pull-request-available (was: ) > [C++] Allow references in std::tuple when converting tuple to arrow array > - > > Key: ARROW-6284 > URL: https://issues.apache.org/jira/browse/ARROW-6284 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Omer Ozarslan >Priority: Minor > Labels: pull-request-available > > This allows using std::tuple (e.g. std::tie) to convert user data types. More > details will be provided in PR. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6284) [C++] Allow references in std::tuple when converting tuple to arrow array
Omer Ozarslan created ARROW-6284: Summary: [C++] Allow references in std::tuple when converting tuple to arrow array Key: ARROW-6284 URL: https://issues.apache.org/jira/browse/ARROW-6284 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Omer Ozarslan This allows using std::tuple (e.g. std::tie) to convert user data types. More details will be provided in PR. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6101) [Rust] [DataFusion] Create physical plan from logical plan
[ https://issues.apache.org/jira/browse/ARROW-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6101: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Create physical plan from logical plan > -- > > Key: ARROW-6101 > URL: https://issues.apache.org/jira/browse/ARROW-6101 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > > Once the physical plan is in place and can be executed, I will implement > logic to convert the logical plan to a physical plan and remove the legacy > code for directly executing a logical plan. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6283) [Rust] [DataFusion] Implement operator to write query results to partitioned CSV
Andy Grove created ARROW-6283: - Summary: [Rust] [DataFusion] Implement operator to write query results to partitioned CSV Key: ARROW-6283 URL: https://issues.apache.org/jira/browse/ARROW-6283 Project: Apache Arrow Issue Type: Sub-task Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5227) [Rust] [DataFusion] Re-implement query execution with an extensible physical query plan
[ https://issues.apache.org/jira/browse/ARROW-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-5227: -- Summary: [Rust] [DataFusion] Re-implement query execution with an extensible physical query plan (was: [Rust] [DataFusion] Implement parallel query execution) > [Rust] [DataFusion] Re-implement query execution with an extensible physical > query plan > --- > > Key: ARROW-5227 > URL: https://issues.apache.org/jira/browse/ARROW-5227 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > This story (maybe it should have been an epic with hindsight) is to > re-implement query execution in DataFusion using a physical plan that > supports partitions and parallel execution. > This will replace the current query execution which happens directly from the > logical plan. > The new physical plan is based on traits and is therefore extensible by other > projects that use Arrow. For example, another project could add physical > plans for distributed compute. > See design doc at > [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] > for more info -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5227) [Rust] [DataFusion] Implement parallel query execution
[ https://issues.apache.org/jira/browse/ARROW-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-5227: -- Description: This story (maybe it should have been an epic with hindsight) is to re-implement query execution in DataFusion using a physical plan that supports partitions and parallel execution. This will replace the current query execution which happens directly from the logical plan. The new physical plan is based on traits and is therefore extensible by other projects that use Arrow. For example, another project could add physical plans for distributed compute. See design doc at [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] for more info was: Implement parallel query execution to take advantage of multiple cores when running queries. See design doc at [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] for more info > [Rust] [DataFusion] Implement parallel query execution > -- > > Key: ARROW-5227 > URL: https://issues.apache.org/jira/browse/ARROW-5227 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > This story (maybe it should have been an epic with hindsight) is to > re-implement query execution in DataFusion using a physical plan that > supports partitions and parallel execution. > This will replace the current query execution which happens directly from the > logical plan. > The new physical plan is based on traits and is therefore extensible by other > projects that use Arrow. For example, another project could add physical > plans for distributed compute. > See design doc at > [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] > for more info -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5227) [Rust] [DataFusion] Implement parallel query execution
[ https://issues.apache.org/jira/browse/ARROW-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-5227: -- Description: Implement parallel query execution to take advantage of multiple cores when running queries. See design doc at [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] for more info was: Implement parallel query execution to take advantage of multiple cores when running queries. See design doc at [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] for more info > [Rust] [DataFusion] Implement parallel query execution > -- > > Key: ARROW-5227 > URL: https://issues.apache.org/jira/browse/ARROW-5227 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > > > Implement parallel query execution to take advantage of multiple cores when > running queries. > See design doc at > [https://docs.google.com/document/d/1ATZGIs8ry_kJeoTgmJjLrg6Ssb5VE7lNzWuz_4p6EWk/edit?usp=sharing] > for more info -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-4588) [JS] add logging
[ https://issues.apache.org/jira/browse/ARROW-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909640#comment-16909640 ] Dominik Moritz commented on ARROW-4588: --- I don't think we have logging set up yet. > [JS] add logging > > > Key: ARROW-4588 > URL: https://issues.apache.org/jira/browse/ARROW-4588 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Dominik Moritz >Priority: Major > > As discussed in https://github.com/apache/arrow/pull/3634, the javascript > library will need some logging infrastructure. The goals for this > implementation are a lightweight logger that can be easily configured to not > write to console. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6282) Support lossy compression
Dominik Moritz created ARROW-6282: - Summary: Support lossy compression Key: ARROW-6282 URL: https://issues.apache.org/jira/browse/ARROW-6282 Project: Apache Arrow Issue Type: New Feature Reporter: Dominik Moritz Arrow dataframes with large columns of integers or floats can be compressed using gzip or brotli. However, in some cases it will be okay to compress the data lossy to achieve even higher compression ratios. The main use case for this is visualization where small inaccuracies matter less. -- This message was sent by Atlassian JIRA (v7.6.14#76016)