[jira] [Created] (ARROW-3258) [GLib] CI is failued on macOS
Kouhei Sutou created ARROW-3258: --- Summary: [GLib] CI is failued on macOS Key: ARROW-3258 URL: https://issues.apache.org/jira/browse/ARROW-3258 Project: Apache Arrow Issue Type: Improvement Components: GLib Affects Versions: 0.10.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou {code} ==> Installing postgis dependency: numpy ==> Downloading https://homebrew.bintray.com/bottles/numpy-1.15.1.sierra.bottle.tar.gz ==> Pouring numpy-1.15.1.sierra.bottle.tar.gz Error: The `brew link` step did not complete successfully The formula built, but is not symlinked into /usr/local Could not symlink lib/python2.7/site-packages/numpy/__config__.py Target /usr/local/lib/python2.7/site-packages/numpy/__config__.py already exists. You may want to remove it: rm '/usr/local/lib/python2.7/site-packages/numpy/__config__.py' To force the link and overwrite all conflicting files: brew link --overwrite numpy To list all files that would be deleted: brew link --overwrite --dry-run numpy {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3257) [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES
[ https://issues.apache.org/jira/browse/ARROW-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3257: -- Labels: pull-request-available (was: ) > [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES > --- > > Key: ARROW-3257 > URL: https://issues.apache.org/jira/browse/ARROW-3257 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.10.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > > Because it's deprecated in CMake 3.2 that is the minimum required > version: > https://cmake.org/cmake/help/v3.2/prop_tgt/IMPORTED_LINK_INTERFACE_LIBRARIES.html > The document says that we should use INTERFACE_LINK_LIBRARIES: > https://cmake.org/cmake/help/v3.2/prop_tgt/INTERFACE_LINK_LIBRARIES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3257) [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES
Kouhei Sutou created ARROW-3257: --- Summary: [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES Key: ARROW-3257 URL: https://issues.apache.org/jira/browse/ARROW-3257 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.10.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou Because it's deprecated in CMake 3.2 that is the minimum required version: https://cmake.org/cmake/help/v3.2/prop_tgt/IMPORTED_LINK_INTERFACE_LIBRARIES.html The document says that we should use INTERFACE_LINK_LIBRARIES: https://cmake.org/cmake/help/v3.2/prop_tgt/INTERFACE_LINK_LIBRARIES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3256) [JS] File footer and message metadata is inconsistent
[ https://issues.apache.org/jira/browse/ARROW-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3256: Description: I added some assertions to the C++ library and found that the body length in the file footer and the IPC message were different {code} ## JS producing, C++ consuming ## == Testing file /home/travis/build/apache/arrow/integration/data/struct_example.json == -- Creating binary inputs node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a /tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow -j /home/travis/build/apache/arrow/integration/data/struct_example.json -- Validating file /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test --integration --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow --json=/home/travis/build/apache/arrow/integration/data/struct_example.json --mode=VALIDATE Command failed: /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test --integration --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow --json=/home/travis/build/apache/arrow/integration/data/struct_example.json --mode=VALIDATE With output: -- /home/travis/build/apache/arrow/cpp/src/arrow/ipc/reader.cc:581 Check failed: (message->body_length()) == (block.body_length) {code} I'm not sure what's wrong. I'll remove the assertions for now was: I added some assertions to the C++ library and found that the body length in the file footer and the IPC message were different {code} ## JS producing, C++ consuming ## == Testing file /home/travis/build/apache/arrow/integration/data/struct_example.json == -- Creating binary inputs node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a /tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow -j /home/travis/build/apache/arrow/integration/data/struct_example.json -- Validating file /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test --integration --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow --json=/home/travis/build/apache/arrow/integration/data/struct_example.json --mode=VALIDATE Command failed: /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test --integration --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow --json=/home/travis/build/apache/arrow/integration/data/struct_example.json --mode=VALIDATE With output: -- /home/travis/build/apache/arrow/cpp/src/arrow/ipc/reader.cc:581 Check failed: (message->body_length()) == (block.body_length) {code} It appears that the order of the lengths is flipped in https://github.com/apache/arrow/blob/master/js/src/ipc/writer/binary.ts#L77 > [JS] File footer and message metadata is inconsistent > - > > Key: ARROW-3256 > URL: https://issues.apache.org/jira/browse/ARROW-3256 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: JS-0.4.0 > > > I added some assertions to the C++ library and found that the body length in > the file footer and the IPC message were different > {code} > ## > JS producing, C++ consuming > ## > == > Testing file > /home/travis/build/apache/arrow/integration/data/struct_example.json > == > -- Creating binary inputs > node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a > /tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow > -j /home/travis/build/apache/arrow/integration/data/struct_example.json > -- Validating file > /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test > --integration > --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow > --json=/home/travis/build/apache/arrow/integration/data/struct_example.json > --mode=VALIDATE > Command failed: > /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test > --integration >
[jira] [Created] (ARROW-3256) [JS] File footer and message metadata is inconsistent
Wes McKinney created ARROW-3256: --- Summary: [JS] File footer and message metadata is inconsistent Key: ARROW-3256 URL: https://issues.apache.org/jira/browse/ARROW-3256 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Wes McKinney Fix For: JS-0.4.0 I added some assertions to the C++ library and found that the body length in the file footer and the IPC message were different {code} ## JS producing, C++ consuming ## == Testing file /home/travis/build/apache/arrow/integration/data/struct_example.json == -- Creating binary inputs node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a /tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow -j /home/travis/build/apache/arrow/integration/data/struct_example.json -- Validating file /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test --integration --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow --json=/home/travis/build/apache/arrow/integration/data/struct_example.json --mode=VALIDATE Command failed: /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test --integration --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow --json=/home/travis/build/apache/arrow/integration/data/struct_example.json --mode=VALIDATE With output: -- /home/travis/build/apache/arrow/cpp/src/arrow/ipc/reader.cc:581 Check failed: (message->body_length()) == (block.body_length) {code} It appears that the order of the lengths is flipped in https://github.com/apache/arrow/blob/master/js/src/ipc/writer/binary.ts#L77 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3196) Enable merge_arrow_py.py script to merge Parquet patches and set fix versions
[ https://issues.apache.org/jira/browse/ARROW-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3196: -- Labels: pull-request-available (was: ) > Enable merge_arrow_py.py script to merge Parquet patches and set fix versions > - > > Key: ARROW-3196 > URL: https://issues.apache.org/jira/browse/ARROW-3196 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > > Follow up to ARROW-3075 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3251) [C++] Conversion warnings in cast.cc
[ https://issues.apache.org/jira/browse/ARROW-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3251. - Resolution: Fixed Fix Version/s: 0.11.0 Issue resolved by pull request 2575 [https://github.com/apache/arrow/pull/2575] > [C++] Conversion warnings in cast.cc > > > Key: ARROW-3251 > URL: https://issues.apache.org/jira/browse/ARROW-3251 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This is with gcc 7.3.0 and {{-Wconversion}}. > {code} > ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void > arrow::compute::CastFunctor std::enable_if I>::value>::type>::operator()(arrow::compute::FunctionContext*, const > arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) > [with O = arrow::Int64Type; I = arrow::DoubleType; typename > std::enable_if::value>::type = void]’: > ../src/arrow/compute/kernels/cast.cc:1105:1: required from here > ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type > {aka double}’ from ‘long int’ may alter its value [-Wconversion] >if (ARROW_PREDICT_FALSE(out_value != *in_data)) { >~~^ > ../src/arrow/util/macros.h:37:50: note: in definition of macro > ‘ARROW_PREDICT_FALSE’ > #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) > ^ > ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type > {aka double}’ from ‘long int’ may alter its value [-Wconversion] >if (ARROW_PREDICT_FALSE(out_value != *in_data)) { >~~^ > ../src/arrow/util/macros.h:37:50: note: in definition of macro > ‘ARROW_PREDICT_FALSE’ > #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618138#comment-16618138 ] Wes McKinney commented on ARROW-3253: - Does Anaconda have all the Windows build deps? That would be OK with me if that works > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618059#comment-16618059 ] Antoine Pitrou commented on ARROW-3253: --- Anaconda is already backed by a CDN, I think. So perhaps we can ditch just use of conda-forge (which also makes conda dependency resolution faster)? > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618049#comment-16618049 ] Wes McKinney commented on ARROW-3253: - It might be a dark path, but we could look at snapshotting the conda packages and putting it in a CDN > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3187) [Plasma] Make Plasma Log pluggable with glog
[ https://issues.apache.org/jira/browse/ARROW-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3187: Fix Version/s: 0.12.0 > [Plasma] Make Plasma Log pluggable with glog > > > Key: ARROW-3187 > URL: https://issues.apache.org/jira/browse/ARROW-3187 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Yuhong Guo >Assignee: Yuhong Guo >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Make Plasma pluggable with glog using Macro. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet
[ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3238: Summary: [Python] Can't read pyarrow string columns in fastparquet (was: Can't read pyarrow string columns in fastparquet) > [Python] Can't read pyarrow string columns in fastparquet > - > > Key: ARROW-3238 > URL: https://issues.apache.org/jira/browse/ARROW-3238 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Theo Walker >Priority: Major > Labels: parquet > > Writing really long strings from pyarrow causes exception in fastparquet read. > {code:java} > Traceback (most recent call last): > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in > read_fastparquet() > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in > read_fastparquet > dff = pf.to_pandas(['A']) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 426, in to_pandas > index=index, assign=parts) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 258, in read_row_group > scheme=self.file_scheme) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 344, in read_row_group > cats, selfmade, assign=assign) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 321, in read_row_group_arrays > catdef=out.get(name+'-catdef', None)) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 235, in read_col > skip_nulls, selfmade=selfmade) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 99, in read_data_page > raw_bytes = _read_page(f, header, metadata) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 31, in _read_page > page_header.uncompressed_page_size) > AssertionError: found 175532 raw bytes (expected 200026){code} > If written with compression, it reports compression errors instead: > {code:java} > SNAPPY: snappy.UncompressError: Error while decompressing: invalid input > GZIP: zlib.error: Error -3 while decompressing data: incorrect header > check{code} > > > Minimal code to reproduce: > {code:java} > import os > import pandas as pd > import pyarrow > import pyarrow.parquet as arrow_pq > from fastparquet import ParquetFile > # data to generate > ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception > N_ROWS = 10 > # file write params > ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is > read (e.g. Nones) > FILENAME = 'test.parquet' > def write_arrow(): > df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]}) > if os.path.isfile(FILENAME): > os.remove(FILENAME) > arrow_table = pyarrow.Table.from_pandas(df) > arrow_pq.write_table(arrow_table, > FILENAME, > use_dictionary=False, > compression='NONE', > row_group_size=ROW_GROUP_SIZE) > def read_arrow(): > print "arrow:" > table2 = arrow_pq.read_table(FILENAME) > print table2.to_pandas().head() > def read_fastparquet(): > print "fastparquet:" > pf = ParquetFile(FILENAME) > dff = pf.to_pandas(['A']) > print dff.head() > write_arrow() > read_arrow() > read_fastparquet(){code} > > Versions: > {code:java} > fastparquet==0.1.6 > pyarrow==0.10.0 > pandas==0.22.0 > sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, > 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code} > Also opened issue here: https://github.com/dask/fastparquet/issues/375 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet
[ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3238: Labels: parquet (was: ) > [Python] Can't read pyarrow string columns in fastparquet > - > > Key: ARROW-3238 > URL: https://issues.apache.org/jira/browse/ARROW-3238 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Theo Walker >Priority: Major > Labels: parquet > > Writing really long strings from pyarrow causes exception in fastparquet read. > {code:java} > Traceback (most recent call last): > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in > read_fastparquet() > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in > read_fastparquet > dff = pf.to_pandas(['A']) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 426, in to_pandas > index=index, assign=parts) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 258, in read_row_group > scheme=self.file_scheme) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 344, in read_row_group > cats, selfmade, assign=assign) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 321, in read_row_group_arrays > catdef=out.get(name+'-catdef', None)) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 235, in read_col > skip_nulls, selfmade=selfmade) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 99, in read_data_page > raw_bytes = _read_page(f, header, metadata) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 31, in _read_page > page_header.uncompressed_page_size) > AssertionError: found 175532 raw bytes (expected 200026){code} > If written with compression, it reports compression errors instead: > {code:java} > SNAPPY: snappy.UncompressError: Error while decompressing: invalid input > GZIP: zlib.error: Error -3 while decompressing data: incorrect header > check{code} > > > Minimal code to reproduce: > {code:java} > import os > import pandas as pd > import pyarrow > import pyarrow.parquet as arrow_pq > from fastparquet import ParquetFile > # data to generate > ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception > N_ROWS = 10 > # file write params > ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is > read (e.g. Nones) > FILENAME = 'test.parquet' > def write_arrow(): > df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]}) > if os.path.isfile(FILENAME): > os.remove(FILENAME) > arrow_table = pyarrow.Table.from_pandas(df) > arrow_pq.write_table(arrow_table, > FILENAME, > use_dictionary=False, > compression='NONE', > row_group_size=ROW_GROUP_SIZE) > def read_arrow(): > print "arrow:" > table2 = arrow_pq.read_table(FILENAME) > print table2.to_pandas().head() > def read_fastparquet(): > print "fastparquet:" > pf = ParquetFile(FILENAME) > dff = pf.to_pandas(['A']) > print dff.head() > write_arrow() > read_arrow() > read_fastparquet(){code} > > Versions: > {code:java} > fastparquet==0.1.6 > pyarrow==0.10.0 > pandas==0.22.0 > sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, > 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code} > Also opened issue here: https://github.com/dask/fastparquet/issues/375 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet
[ https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3238: Component/s: Python > [Python] Can't read pyarrow string columns in fastparquet > - > > Key: ARROW-3238 > URL: https://issues.apache.org/jira/browse/ARROW-3238 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Theo Walker >Priority: Major > Labels: parquet > > Writing really long strings from pyarrow causes exception in fastparquet read. > {code:java} > Traceback (most recent call last): > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in > read_fastparquet() > File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in > read_fastparquet > dff = pf.to_pandas(['A']) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 426, in to_pandas > index=index, assign=parts) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", > line 258, in read_row_group > scheme=self.file_scheme) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 344, in read_row_group > cats, selfmade, assign=assign) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 321, in read_row_group_arrays > catdef=out.get(name+'-catdef', None)) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 235, in read_col > skip_nulls, selfmade=selfmade) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 99, in read_data_page > raw_bytes = _read_page(f, header, metadata) > File > "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", > line 31, in _read_page > page_header.uncompressed_page_size) > AssertionError: found 175532 raw bytes (expected 200026){code} > If written with compression, it reports compression errors instead: > {code:java} > SNAPPY: snappy.UncompressError: Error while decompressing: invalid input > GZIP: zlib.error: Error -3 while decompressing data: incorrect header > check{code} > > > Minimal code to reproduce: > {code:java} > import os > import pandas as pd > import pyarrow > import pyarrow.parquet as arrow_pq > from fastparquet import ParquetFile > # data to generate > ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception > N_ROWS = 10 > # file write params > ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is > read (e.g. Nones) > FILENAME = 'test.parquet' > def write_arrow(): > df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]}) > if os.path.isfile(FILENAME): > os.remove(FILENAME) > arrow_table = pyarrow.Table.from_pandas(df) > arrow_pq.write_table(arrow_table, > FILENAME, > use_dictionary=False, > compression='NONE', > row_group_size=ROW_GROUP_SIZE) > def read_arrow(): > print "arrow:" > table2 = arrow_pq.read_table(FILENAME) > print table2.to_pandas().head() > def read_fastparquet(): > print "fastparquet:" > pf = ParquetFile(FILENAME) > dff = pf.to_pandas(['A']) > print dff.head() > write_arrow() > read_arrow() > read_fastparquet(){code} > > Versions: > {code:java} > fastparquet==0.1.6 > pyarrow==0.10.0 > pandas==0.22.0 > sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, > 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code} > Also opened issue here: https://github.com/dask/fastparquet/issues/375 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617954#comment-16617954 ] Antoine Pitrou commented on ARROW-3253: --- Ironically the toolchain is also rather slow to fetch... at least when using conda-forge. > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3255) [C++/Python] Migrate Travis CI jobs off Xcode 6.4
Wes McKinney created ARROW-3255: --- Summary: [C++/Python] Migrate Travis CI jobs off Xcode 6.4 Key: ARROW-3255 URL: https://issues.apache.org/jira/browse/ARROW-3255 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 Travis CI says they are winding down their support for Xcode 6.4, which we use in our CI as the minimum Xcode which can build Arrow libraries "Running builds with Xcode 6.4 in Travis CI is deprecated and will be removed in January 2019. If Xcode 6.4 is critical to your builds, please contact our support team at supp...@travis-ci.com to discuss options. Services are not supported on osx" We should decide if we want to continue to support this version of Xcode, and what are the implications if we do not -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3254) [C++] Add option to ADD_ARROW_TEST to compose a test executable from multiple .cc files containing unit tests
Wes McKinney created ARROW-3254: --- Summary: [C++] Add option to ADD_ARROW_TEST to compose a test executable from multiple .cc files containing unit tests Key: ARROW-3254 URL: https://issues.apache.org/jira/browse/ARROW-3254 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 Currently there is a 1-1 correspondence between a .cc file containing unit tests to a test executable. There's good reasons (like readability, code organization) to split up a large test suite among many files. But there are downsides: * Linking test executables is slow, especially on Windows * Test executables take up quite a bit of space (the debug/ directory on Linux after a full build is ~1GB) I suggest enabling ADD_ARROW_TEST to accept a list of files which will be build together into a single test. This will allow us to combine a number of our unit tests and save time and space -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617932#comment-16617932 ] Wes McKinney commented on ARROW-3253: - I think our CI should just use the toolchain always for performance and we should move our "thirdparty testing" to a Crossbow job, so we can verify nightly or on demand that all the projects will build automatically from source > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617931#comment-16617931 ] Wes McKinney commented on ARROW-3253: - Ouch. That build has some other problems -- it's building Thrift from source which is really slow: {code} -- THRIFT_HOME: -- Thrift compiler/libraries NOT found: (THRIFT_INCLUDE_DIR-NOTFOUND, THRIFT_STATIC_LIB-NOTFOUND). Looked in system search paths. -- Thrift include dir: C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/include -- Thrift static library: C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/lib/thriftmd.lib -- Thrift compiler: C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/bin/thrift -- Thrift version: 0.11.0 {code} > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3183) [Python] get_library_dirs on Windows can give the wrong directory
[ https://issues.apache.org/jira/browse/ARROW-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3183: --- Assignee: Victor Uriarte > [Python] get_library_dirs on Windows can give the wrong directory > - > > Key: ARROW-3183 > URL: https://issues.apache.org/jira/browse/ARROW-3183 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0, 0.10.0 > Environment: Windows 10 > Anaconda Python 3.6 >Reporter: Victor Uriarte >Assignee: Victor Uriarte >Priority: Minor > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Python Version: Anaconda 3.6 > PyArrow Version: 0.9.0 and 0.10.0 > Installed by: conda > {{The function pa.get_library_dirs() points to the wrong directory}} > {{import pyarrow as pa}} > {{print(pa.get_library_dirs())}} > returns (Notice the extra lib in the middle of the 2nd string): > {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', > 'C:\\Anaconda\\lib\\Library\\lib']}} > but it should be: > {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', > 'C:\\Anaconda\\Library\\lib']}} > Not sure if this is dependent on how `pyarrow` was installed on the system. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3183) [Python] get_library_dirs on Windows can give the wrong directory
[ https://issues.apache.org/jira/browse/ARROW-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3183. - Resolution: Fixed Issue resolved by pull request 2518 [https://github.com/apache/arrow/pull/2518] > [Python] get_library_dirs on Windows can give the wrong directory > - > > Key: ARROW-3183 > URL: https://issues.apache.org/jira/browse/ARROW-3183 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0, 0.10.0 > Environment: Windows 10 > Anaconda Python 3.6 >Reporter: Victor Uriarte >Priority: Minor > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Python Version: Anaconda 3.6 > PyArrow Version: 0.9.0 and 0.10.0 > Installed by: conda > {{The function pa.get_library_dirs() points to the wrong directory}} > {{import pyarrow as pa}} > {{print(pa.get_library_dirs())}} > returns (Notice the extra lib in the middle of the 2nd string): > {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', > 'C:\\Anaconda\\lib\\Library\\lib']}} > but it should be: > {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', > 'C:\\Anaconda\\Library\\lib']}} > Not sure if this is dependent on how `pyarrow` was installed on the system. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617908#comment-16617908 ] Antoine Pitrou commented on ARROW-3253: --- It seems the C++ build phase is ballooning. See here, 19 minutes to end up with a compilation failure (no unittest executed): https://ci.appveyor.com/project/pitrou/arrow/build/1.0.732 > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3190) [C++] "WriteableFile" is misspelled, should be renamed "WritableFile" with deprecation for old name
[ https://issues.apache.org/jira/browse/ARROW-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3190. - Resolution: Fixed Issue resolved by pull request 2569 [https://github.com/apache/arrow/pull/2569] > [C++] "WriteableFile" is misspelled, should be renamed "WritableFile" with > deprecation for old name > --- > > Key: ARROW-3190 > URL: https://issues.apache.org/jira/browse/ARROW-3190 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > See e.g. > https://docs.oracle.com/javase/7/docs/api/java/nio/channels/WritableByteChannel.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI
[ https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617885#comment-16617885 ] Wes McKinney commented on ARROW-3253: - I suggest splitting the C++ unit tests into a separate build from the Python unit tests as one way to speed things up. The build times aren't _too_ bad though yet: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/build/1.0.7968 > [CI] Investigate Azure CI > - > > Key: ARROW-3253 > URL: https://issues.apache.org/jira/browse/ARROW-3253 > Project: Apache Arrow > Issue Type: Task > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > C++ builds on AppVeyor have become slower and slower. Some of it may be due > to the parquet-cpp repository merge, but I also suspect CPU resources on > AppVeyor have become much tighter. > We should perhaps investigate Microsoft's Azure CI services as an alternative: > https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3253) [CI] Investigate Azure CI
Antoine Pitrou created ARROW-3253: - Summary: [CI] Investigate Azure CI Key: ARROW-3253 URL: https://issues.apache.org/jira/browse/ARROW-3253 Project: Apache Arrow Issue Type: Task Components: C++, Continuous Integration Reporter: Antoine Pitrou C++ builds on AppVeyor have become slower and slower. Some of it may be due to the parquet-cpp repository merge, but I also suspect CPU resources on AppVeyor have become much tighter. We should perhaps investigate Microsoft's Azure CI services as an alternative: https://azure.microsoft.com/en-gb/services/devops/pipelines/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617883#comment-16617883 ] Wes McKinney commented on ARROW-300: Moving this to 0.12. I will make a proposal for compressed record batches after the 0.11 release goes out. My gut instinct on this would be to create a {{CompressedBuffer}} metadata type and a {{CompressedRecordBatch}} message. Some reasons: * Does not complicate or bloat the existing RecordBatch message type * Support buffer-level compression (each buffer can be compressed or not) Readers can choose to materialize right away or on demand -- in C++, we can create a {{arrow::CompressedRecordBatch}} class if we want that does late materialization. This does not necessarily accommodate other kinds of type-specific compression, like RLE-encoding, or it might be that RLE can be used on the values buffer of primitive types, e.g. {code} CompressedBuffer { CompressionType type; int64 offset; int64 compressed_size; int64 uncompressed_size; } {code} So if we wanted to use the Parquet RLE_BITPACKED_HYBRID compression style for integers, say, we could do that. Another question here is how to handle compressions which may have additional parameters. {{CompressionType}} or {{Compression}} could be a union, but that would make the message sizes larger (but maybe that's OK) > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-300: --- Fix Version/s: (was: 0.13.0) 0.12.0 > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-300: -- Assignee: Wes McKinney > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3249) [Python] Run flake8 on integration_test.py
[ https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3249: --- Assignee: Wes McKinney > [Python] Run flake8 on integration_test.py > -- > > Key: ARROW-3249 > URL: https://issues.apache.org/jira/browse/ARROW-3249 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3196) Enable merge_arrow_py.py script to merge Parquet patches and set fix versions
[ https://issues.apache.org/jira/browse/ARROW-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3196: --- Assignee: Wes McKinney > Enable merge_arrow_py.py script to merge Parquet patches and set fix versions > - > > Key: ARROW-3196 > URL: https://issues.apache.org/jira/browse/ARROW-3196 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > Follow up to ARROW-3075 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3198) [Website] Blog post regarding Arrow-Parquet C++ monorepo effort
[ https://issues.apache.org/jira/browse/ARROW-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3198: --- Assignee: Wes McKinney > [Website] Blog post regarding Arrow-Parquet C++ monorepo effort > --- > > Key: ARROW-3198 > URL: https://issues.apache.org/jira/browse/ARROW-3198 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3076) [Website] Add Google Analytics tags to generated API documentation
[ https://issues.apache.org/jira/browse/ARROW-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3076: --- Assignee: Wes McKinney > [Website] Add Google Analytics tags to generated API documentation > -- > > Key: ARROW-3076 > URL: https://issues.apache.org/jira/browse/ARROW-3076 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > It would be helpful to see which parts of the documentation are seeing traffic -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3197) [C++] Add instructions to cpp/README.md about Parquet-only development and Arrow+Parquet
[ https://issues.apache.org/jira/browse/ARROW-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3197: --- Assignee: Wes McKinney > [C++] Add instructions to cpp/README.md about Parquet-only development and > Arrow+Parquet > > > Key: ARROW-3197 > URL: https://issues.apache.org/jira/browse/ARROW-3197 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.11.0 > > > There are two distinct development workflows -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3056) [Python] Indicate in NativeFile docstrings methods that are part of the RawIOBase API but not implemented
[ https://issues.apache.org/jira/browse/ARROW-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3056: -- Labels: pull-request-available (was: ) > [Python] Indicate in NativeFile docstrings methods that are part of the > RawIOBase API but not implemented > - > > Key: ARROW-3056 > URL: https://issues.apache.org/jira/browse/ARROW-3056 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > > see https://github.com/apache/arrow/issues/2422 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3212) [C++] Create deterministic IPC metadata
[ https://issues.apache.org/jira/browse/ARROW-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3212: --- Assignee: Wes McKinney > [C++] Create deterministic IPC metadata > --- > > Key: ARROW-3212 > URL: https://issues.apache.org/jira/browse/ARROW-3212 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > Currently, the amount of padding bytes written after the IPC metadata header > depends on the current position of the {{OutputStream}} passed. So if the > message begins on an unaligned (not multiple of 8) offset, then the content > of the metadata will be different than if it did. This seems like a leaky > abstraction -- aligning the stream should probably be handled separately from > writing the IPC protocol. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3056) [Python] Indicate in NativeFile docstrings methods that are part of the RawIOBase API but not implemented
[ https://issues.apache.org/jira/browse/ARROW-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3056: --- Assignee: Wes McKinney > [Python] Indicate in NativeFile docstrings methods that are part of the > RawIOBase API but not implemented > - > > Key: ARROW-3056 > URL: https://issues.apache.org/jira/browse/ARROW-3056 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > see https://github.com/apache/arrow/issues/2422 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods
[ https://issues.apache.org/jira/browse/ARROW-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman reassigned ARROW-2600: Assignee: (was: Alex Hagerman) > [Python] Add additional LocalFileSystem filesystem methods > -- > > Key: ARROW-2600 > URL: https://issues.apache.org/jira/browse/ARROW-2600 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alex Hagerman >Priority: Minor > Labels: filesystem, pull-request-available > Fix For: 0.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the > methods Martin listed are also not part of the LocalFileSystem class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-25) [C++] Implement delimited file scanner / CSV reader
[ https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-25: Labels: csv pull-request-available (was: csv) > [C++] Implement delimited file scanner / CSV reader > --- > > Key: ARROW-25 > URL: https://issues.apache.org/jira/browse/ARROW-25 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, pull-request-available > > Like Parquet and binary file formats, text files will be an important data > medium for converting to and from in-memory Arrow data. > pandas has some (Apache-compatible) business logic we can learn from here (as > one of the gold-standard CSV readers in production use) > https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h > https://github.com/pydata/pandas/blob/master/pandas/parser.pyx > While very fast, this this should be largely written from scratch to target > the Arrow memory layout, but we can reuse certain aspects like the tokenizer > DFA (which originally came from the Python interpreter csv module > implementation) > https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3251) [C++] Conversion warnings in cast.cc
[ https://issues.apache.org/jira/browse/ARROW-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3251: -- Labels: pull-request-available (was: ) > [C++] Conversion warnings in cast.cc > > > Key: ARROW-3251 > URL: https://issues.apache.org/jira/browse/ARROW-3251 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > This is with gcc 7.3.0 and {{-Wconversion}}. > {code} > ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void > arrow::compute::CastFunctor std::enable_if I>::value>::type>::operator()(arrow::compute::FunctionContext*, const > arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) > [with O = arrow::Int64Type; I = arrow::DoubleType; typename > std::enable_if::value>::type = void]’: > ../src/arrow/compute/kernels/cast.cc:1105:1: required from here > ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type > {aka double}’ from ‘long int’ may alter its value [-Wconversion] >if (ARROW_PREDICT_FALSE(out_value != *in_data)) { >~~^ > ../src/arrow/util/macros.h:37:50: note: in definition of macro > ‘ARROW_PREDICT_FALSE’ > #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) > ^ > ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type > {aka double}’ from ‘long int’ may alter its value [-Wconversion] >if (ARROW_PREDICT_FALSE(out_value != *in_data)) { >~~^ > ../src/arrow/util/macros.h:37:50: note: in definition of macro > ‘ARROW_PREDICT_FALSE’ > #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3252) [C++] Do not hard code the "v" part of versions in thirdparty toolchain
Wes McKinney created ARROW-3252: --- Summary: [C++] Do not hard code the "v" part of versions in thirdparty toolchain Key: ARROW-3252 URL: https://issues.apache.org/jira/browse/ARROW-3252 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.11.0 When I changed Flatbuffers from "v1.8.0" to a git hash, it broke the dependency download script. We should move all the version string to versions.txt rather than having some "v${FOO_URL}" in ThirdpartyToolchain.cmake -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3251) [C++] Conversion warnings in cast.cc
[ https://issues.apache.org/jira/browse/ARROW-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-3251: - Assignee: Antoine Pitrou > [C++] Conversion warnings in cast.cc > > > Key: ARROW-3251 > URL: https://issues.apache.org/jira/browse/ARROW-3251 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > This is with gcc 7.3.0 and {{-Wconversion}}. > {code} > ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void > arrow::compute::CastFunctor std::enable_if I>::value>::type>::operator()(arrow::compute::FunctionContext*, const > arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) > [with O = arrow::Int64Type; I = arrow::DoubleType; typename > std::enable_if::value>::type = void]’: > ../src/arrow/compute/kernels/cast.cc:1105:1: required from here > ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type > {aka double}’ from ‘long int’ may alter its value [-Wconversion] >if (ARROW_PREDICT_FALSE(out_value != *in_data)) { >~~^ > ../src/arrow/util/macros.h:37:50: note: in definition of macro > ‘ARROW_PREDICT_FALSE’ > #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) > ^ > ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type > {aka double}’ from ‘long int’ may alter its value [-Wconversion] >if (ARROW_PREDICT_FALSE(out_value != *in_data)) { >~~^ > ../src/arrow/util/macros.h:37:50: note: in definition of macro > ‘ARROW_PREDICT_FALSE’ > #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3251) [C++] Conversion warnings in cast.cc
Antoine Pitrou created ARROW-3251: - Summary: [C++] Conversion warnings in cast.cc Key: ARROW-3251 URL: https://issues.apache.org/jira/browse/ARROW-3251 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou This is with gcc 7.3.0 and {{-Wconversion}}. {code} ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void arrow::compute::CastFunctor::value>::type>::operator()(arrow::compute::FunctionContext*, const arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) [with O = arrow::Int64Type; I = arrow::DoubleType; typename std::enable_if::value>::type = void]’: ../src/arrow/compute/kernels/cast.cc:1105:1: required from here ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type {aka double}’ from ‘long int’ may alter its value [-Wconversion] if (ARROW_PREDICT_FALSE(out_value != *in_data)) { ~~^ ../src/arrow/util/macros.h:37:50: note: in definition of macro ‘ARROW_PREDICT_FALSE’ #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) ^ ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type {aka double}’ from ‘long int’ may alter its value [-Wconversion] if (ARROW_PREDICT_FALSE(out_value != *in_data)) { ~~^ ../src/arrow/util/macros.h:37:50: note: in definition of macro ‘ARROW_PREDICT_FALSE’ #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0)) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3157) [C++] Improve buffer creation for typed data
[ https://issues.apache.org/jira/browse/ARROW-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3157. - Resolution: Fixed Issue resolved by pull request 2566 [https://github.com/apache/arrow/pull/2566] > [C++] Improve buffer creation for typed data > > > Key: ARROW-3157 > URL: https://issues.apache.org/jira/browse/ARROW-3157 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Philipp Moritz >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available, usability > Fix For: 0.11.0 > > Time Spent: 1h > Remaining Estimate: 0h > > While looking into [https://github.com/apache/arrow/pull/2481,] I noticed > this pattern: > {code:java} > const uint8_t* bytes_array = reinterpret_cast(input); > auto buffer = std::make_shared(bytes_array, > sizeof(float)*input_length);{code} > It's not the end of the world but seems a little verbose to me. It would be > great to have something like this: > {code:java} > auto buffer = MakeBuffer(input, input_length);{code} > I couldn't find it, does it already exist somewhere? Any thoughts on the API? > Potentially specializations to make a buffer out of a std::vector would > also be helpful. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3227) [Python] NativeFile.write shouldn't accept unicode strings
[ https://issues.apache.org/jira/browse/ARROW-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3227. - Resolution: Fixed Issue resolved by pull request 2570 [https://github.com/apache/arrow/pull/2570] > [Python] NativeFile.write shouldn't accept unicode strings > -- > > Key: ARROW-3227 > URL: https://issues.apache.org/jira/browse/ARROW-3227 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Arrow files are binary, but for some reason {{NativeFile.write}} silently > converts unicode strings to bytes. > {code:python} > >>> b = io.BytesIO() > >>> b.write("foo") > Traceback (most recent call last): > File "", line 1, in > b.write("foo") > TypeError: a bytes-like object is required, not 'str' > >>> f = pa.PythonFile(b) > >>> f.write("foo") > >>> b.getvalue() > b'foo' > >>> f.write("") > >>> b.getvalue() > b'foo\xf0\x9f\x98\x80' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3241) [Plasma] test_plasma_list test failure on Ubuntu 14.04
[ https://issues.apache.org/jira/browse/ARROW-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3241. - Resolution: Fixed Resolved by https://github.com/apache/arrow/commit/c698be339b96aeb74763d70de1cf4c8789148824 > [Plasma] test_plasma_list test failure on Ubuntu 14.04 > -- > > Key: ARROW-3241 > URL: https://issues.apache.org/jira/browse/ARROW-3241 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Fix For: 0.11.0 > > > This test fails consistently for me on Ubuntu 14.04 / Python 3.6.5 > {code} > pyarrow/tests/test_plasma.py::test_plasma_list FAILED > > [ 83%] > >>> > > >>> >>> > >>> captured stderr > >>> >>> > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /dev/shm and huge page support disabled > >> > > >> >> > >> traceback > >> >> > @pytest.mark.plasma > def test_plasma_list(): > import pyarrow.plasma as plasma > > with plasma.start_plasma_store( > plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY) \ > as (plasma_store_name, p): > plasma_client = plasma.connect(plasma_store_name, "", 0) > > # Test sizes > u, _, _ = create_object(plasma_client, 11, metadata_size=7, > seal=False) > l1 = plasma_client.list() > assert l1[u]["data_size"] == 11 > assert l1[u]["metadata_size"] == 7 > > # Test ref_count > v = plasma_client.put(np.zeros(3)) > l2 = plasma_client.list() > # Ref count has already been released > assert l2[v]["ref_count"] == 0 > a = plasma_client.get(v) > l3 = plasma_client.list() > > assert l3[v]["ref_count"] == 1 > E assert 0 == 1 > pyarrow/tests/test_plasma.py:825: AssertionError > > > > entering PDB > > > > /home/wesm/code/arrow/python/pyarrow/tests/test_plasma.py(825)test_plasma_list() > -> assert l3[v]["ref_count"] == 1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1669) [C++] Consider adding Abseil (Google C++11 standard library extensions) to toolchain
[ https://issues.apache.org/jira/browse/ARROW-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617698#comment-16617698 ] Antoine Pitrou commented on ARROW-1669: --- The baseline glibc version for manylinux1 is too old for Abseil, see ARROW-2461. > [C++] Consider adding Abseil (Google C++11 standard library extensions) to > toolchain > > > Key: ARROW-1669 > URL: https://issues.apache.org/jira/browse/ARROW-1669 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Google has released a library of C++11-compliant extensions to the STL that > may help make a lot of Arrow code simpler: > https://github.com/abseil/abseil-cpp/ > This code is not header-only and so would require some effort to add to the > toolchain at the moment since it only supports the Bazel build system -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2461) [Python] Build wheels for manylinux2010 tag
[ https://issues.apache.org/jira/browse/ARROW-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617695#comment-16617695 ] Antoine Pitrou commented on ARROW-2461: --- I also asked on distutils-sig: https://mail.python.org/mm3/archives/list/distutils-...@python.org/thread/M4MSVY5MPAPXFWHH4PBLE6PEBPOBIA44/ > [Python] Build wheels for manylinux2010 tag > --- > > Key: ARROW-2461 > URL: https://issues.apache.org/jira/browse/ARROW-2461 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Uwe L. Korn >Priority: Blocker > Fix For: 0.12.0 > > > There is now work in progress on an updated manylinux tag based on CentOS6. > We should provide wheels for this tag and the old {{manylinux1}} tag for one > release and then switch to the new tag in the release afterwards. This should > enable us also to raise the minimum compiler requirement to gcc 4.9 (or > higher once conda-forge has migrated to a newer compiler). > The relevant PEP is https://www.python.org/dev/peps/pep-0571/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3250) [C++] Create Buffer implementation that takes ownership for the memory from a std::string via std::move
Wes McKinney created ARROW-3250: --- Summary: [C++] Create Buffer implementation that takes ownership for the memory from a std::string via std::move Key: ARROW-3250 URL: https://issues.apache.org/jira/browse/ARROW-3250 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 There are instances where it is useful to be able retain ownership to a {{std::string}} owned by another as a {{arrow::Buffer}}, so we could have an interface like {{StlStringBuffer(std::string&& input)}} and transfer the memory that way -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3249) [Python] Run flake8 on integration_test.py
Wes McKinney created ARROW-3249: --- Summary: [Python] Run flake8 on integration_test.py Key: ARROW-3249 URL: https://issues.apache.org/jira/browse/ARROW-3249 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.11.0 We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3248) [C++] Arrow tests should have label "arrow"
Antoine Pitrou created ARROW-3248: - Summary: [C++] Arrow tests should have label "arrow" Key: ARROW-3248 URL: https://issues.apache.org/jira/browse/ARROW-3248 Project: Apache Arrow Issue Type: Wish Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou It would help executing only them, not Parquet unit tests which for some reason are quite a bit longer to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005)