[jira] [Created] (ARROW-3258) [GLib] CI is failued on macOS

2018-09-17 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3258:
---

 Summary: [GLib] CI is failued on macOS
 Key: ARROW-3258
 URL: https://issues.apache.org/jira/browse/ARROW-3258
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Affects Versions: 0.10.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


{code}
==> Installing postgis dependency: numpy
==> Downloading 
https://homebrew.bintray.com/bottles/numpy-1.15.1.sierra.bottle.tar.gz
==> Pouring numpy-1.15.1.sierra.bottle.tar.gz
Error: The `brew link` step did not complete successfully
The formula built, but is not symlinked into /usr/local
Could not symlink lib/python2.7/site-packages/numpy/__config__.py
Target /usr/local/lib/python2.7/site-packages/numpy/__config__.py
already exists. You may want to remove it:
 rm '/usr/local/lib/python2.7/site-packages/numpy/__config__.py'
 
To force the link and overwrite all conflicting files:
 brew link --overwrite numpy
 
To list all files that would be deleted:
 brew link --overwrite --dry-run numpy
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3257) [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES

2018-09-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3257:
--
Labels: pull-request-available  (was: )

> [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES
> ---
>
> Key: ARROW-3257
> URL: https://issues.apache.org/jira/browse/ARROW-3257
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.10.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>
> Because it's deprecated in CMake 3.2 that is the minimum required
> version:
> https://cmake.org/cmake/help/v3.2/prop_tgt/IMPORTED_LINK_INTERFACE_LIBRARIES.html
> The document says that we should use INTERFACE_LINK_LIBRARIES:
> https://cmake.org/cmake/help/v3.2/prop_tgt/INTERFACE_LINK_LIBRARIES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3257) [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES

2018-09-17 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3257:
---

 Summary: [C++] Stop to use IMPORTED_LINK_INTERFACE_LIBRARIES
 Key: ARROW-3257
 URL: https://issues.apache.org/jira/browse/ARROW-3257
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.10.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because it's deprecated in CMake 3.2 that is the minimum required
version:

https://cmake.org/cmake/help/v3.2/prop_tgt/IMPORTED_LINK_INTERFACE_LIBRARIES.html

The document says that we should use INTERFACE_LINK_LIBRARIES:

https://cmake.org/cmake/help/v3.2/prop_tgt/INTERFACE_LINK_LIBRARIES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3256) [JS] File footer and message metadata is inconsistent

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3256:

Description: 
I added some assertions to the C++ library and found that the body length in 
the file footer and the IPC message were different

{code}
##
JS producing, C++ consuming
##
==
Testing file 
/home/travis/build/apache/arrow/integration/data/struct_example.json
==
-- Creating binary inputs
node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a 
/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow 
-j /home/travis/build/apache/arrow/integration/data/struct_example.json
-- Validating file
/home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
--integration 
--arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
 --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
--mode=VALIDATE
Command failed: 
/home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
--integration 
--arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
 --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
--mode=VALIDATE
With output:
--
/home/travis/build/apache/arrow/cpp/src/arrow/ipc/reader.cc:581 Check failed: 
(message->body_length()) == (block.body_length)
{code}

I'm not sure what's wrong. I'll remove the assertions for now

  was:
I added some assertions to the C++ library and found that the body length in 
the file footer and the IPC message were different

{code}
##
JS producing, C++ consuming
##
==
Testing file 
/home/travis/build/apache/arrow/integration/data/struct_example.json
==
-- Creating binary inputs
node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a 
/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow 
-j /home/travis/build/apache/arrow/integration/data/struct_example.json
-- Validating file
/home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
--integration 
--arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
 --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
--mode=VALIDATE
Command failed: 
/home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
--integration 
--arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
 --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
--mode=VALIDATE
With output:
--
/home/travis/build/apache/arrow/cpp/src/arrow/ipc/reader.cc:581 Check failed: 
(message->body_length()) == (block.body_length)
{code}

It appears that the order of the lengths is flipped in

https://github.com/apache/arrow/blob/master/js/src/ipc/writer/binary.ts#L77


> [JS] File footer and message metadata is inconsistent
> -
>
> Key: ARROW-3256
> URL: https://issues.apache.org/jira/browse/ARROW-3256
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.4.0
>
>
> I added some assertions to the C++ library and found that the body length in 
> the file footer and the IPC message were different
> {code}
> ##
> JS producing, C++ consuming
> ##
> ==
> Testing file 
> /home/travis/build/apache/arrow/integration/data/struct_example.json
> ==
> -- Creating binary inputs
> node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a 
> /tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
>  -j /home/travis/build/apache/arrow/integration/data/struct_example.json
> -- Validating file
> /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
> --integration 
> --arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
>  --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
> --mode=VALIDATE
> Command failed: 
> /home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
> --integration 
> 

[jira] [Created] (ARROW-3256) [JS] File footer and message metadata is inconsistent

2018-09-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3256:
---

 Summary: [JS] File footer and message metadata is inconsistent
 Key: ARROW-3256
 URL: https://issues.apache.org/jira/browse/ARROW-3256
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Wes McKinney
 Fix For: JS-0.4.0


I added some assertions to the C++ library and found that the body length in 
the file footer and the IPC message were different

{code}
##
JS producing, C++ consuming
##
==
Testing file 
/home/travis/build/apache/arrow/integration/data/struct_example.json
==
-- Creating binary inputs
node --no-warnings /home/travis/build/apache/arrow/js/bin/json-to-arrow.js -a 
/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow 
-j /home/travis/build/apache/arrow/integration/data/struct_example.json
-- Validating file
/home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
--integration 
--arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
 --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
--mode=VALIDATE
Command failed: 
/home/travis/build/apache/arrow/cpp-build/debug/json-integration-test 
--integration 
--arrow=/tmp/tmplbm3vbwz/3d2269c960f148b6b94e5f881c0bf9ca_struct_example.json_to_arrow
 --json=/home/travis/build/apache/arrow/integration/data/struct_example.json 
--mode=VALIDATE
With output:
--
/home/travis/build/apache/arrow/cpp/src/arrow/ipc/reader.cc:581 Check failed: 
(message->body_length()) == (block.body_length)
{code}

It appears that the order of the lengths is flipped in

https://github.com/apache/arrow/blob/master/js/src/ipc/writer/binary.ts#L77



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3196) Enable merge_arrow_py.py script to merge Parquet patches and set fix versions

2018-09-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3196:
--
Labels: pull-request-available  (was: )

> Enable merge_arrow_py.py script to merge Parquet patches and set fix versions
> -
>
> Key: ARROW-3196
> URL: https://issues.apache.org/jira/browse/ARROW-3196
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Follow up to ARROW-3075



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3251) [C++] Conversion warnings in cast.cc

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3251.
-
   Resolution: Fixed
Fix Version/s: 0.11.0

Issue resolved by pull request 2575
[https://github.com/apache/arrow/pull/2575]

> [C++] Conversion warnings in cast.cc
> 
>
> Key: ARROW-3251
> URL: https://issues.apache.org/jira/browse/ARROW-3251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is with gcc 7.3.0 and {{-Wconversion}}.
> {code}
> ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void 
> arrow::compute::CastFunctor std::enable_if I>::value>::type>::operator()(arrow::compute::FunctionContext*, const 
> arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) 
> [with O = arrow::Int64Type; I = arrow::DoubleType; typename 
> std::enable_if::value>::type = void]’:
> ../src/arrow/compute/kernels/cast.cc:1105:1:   required from here
> ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type 
> {aka double}’ from ‘long int’ may alter its value [-Wconversion]
>if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
>~~^
> ../src/arrow/util/macros.h:37:50: note: in definition of macro 
> ‘ARROW_PREDICT_FALSE’
>  #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
>   ^
> ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type 
> {aka double}’ from ‘long int’ may alter its value [-Wconversion]
>if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
>~~^
> ../src/arrow/util/macros.h:37:50: note: in definition of macro 
> ‘ARROW_PREDICT_FALSE’
>  #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618138#comment-16618138
 ] 

Wes McKinney commented on ARROW-3253:
-

Does Anaconda have all the Windows build deps? That would be OK with me if that 
works

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618059#comment-16618059
 ] 

Antoine Pitrou commented on ARROW-3253:
---

Anaconda is already backed by a CDN, I think. So perhaps we can ditch just use 
of conda-forge (which also makes conda dependency resolution faster)?

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618049#comment-16618049
 ] 

Wes McKinney commented on ARROW-3253:
-

It might be a dark path, but we could look at snapshotting the conda packages 
and putting it in a CDN

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3187) [Plasma] Make Plasma Log pluggable with glog

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3187:

Fix Version/s: 0.12.0

> [Plasma] Make Plasma Log pluggable with glog
> 
>
> Key: ARROW-3187
> URL: https://issues.apache.org/jira/browse/ARROW-3187
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yuhong Guo
>Assignee: Yuhong Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Make Plasma pluggable with glog using Macro.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3238:

Summary: [Python] Can't read pyarrow string columns in fastparquet  (was: 
Can't read pyarrow string columns in fastparquet)

> [Python] Can't read pyarrow string columns in fastparquet
> -
>
> Key: ARROW-3238
> URL: https://issues.apache.org/jira/browse/ARROW-3238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Theo Walker
>Priority: Major
>  Labels: parquet
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in 
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in 
> read_fastparquet
> dff = pf.to_pandas(['A'])
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 426, in to_pandas
> index=index, assign=parts)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 258, in read_row_group
> scheme=self.file_scheme)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 344, in read_row_group
> cats, selfmade, assign=assign)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header 
> check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is 
> read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
> df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
> if os.path.isfile(FILENAME):
> os.remove(FILENAME)
> arrow_table = pyarrow.Table.from_pandas(df)
> arrow_pq.write_table(arrow_table,
> FILENAME,
> use_dictionary=False,
> compression='NONE',
> row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
> print "arrow:"
> table2 = arrow_pq.read_table(FILENAME)
> print table2.to_pandas().head()
> def read_fastparquet():
> print "fastparquet:"
> pf = ParquetFile(FILENAME)
> dff = pf.to_pandas(['A'])
> print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet(){code}
>  
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 
> 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3238:

Labels: parquet  (was: )

> [Python] Can't read pyarrow string columns in fastparquet
> -
>
> Key: ARROW-3238
> URL: https://issues.apache.org/jira/browse/ARROW-3238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Theo Walker
>Priority: Major
>  Labels: parquet
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in 
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in 
> read_fastparquet
> dff = pf.to_pandas(['A'])
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 426, in to_pandas
> index=index, assign=parts)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 258, in read_row_group
> scheme=self.file_scheme)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 344, in read_row_group
> cats, selfmade, assign=assign)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header 
> check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is 
> read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
> df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
> if os.path.isfile(FILENAME):
> os.remove(FILENAME)
> arrow_table = pyarrow.Table.from_pandas(df)
> arrow_pq.write_table(arrow_table,
> FILENAME,
> use_dictionary=False,
> compression='NONE',
> row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
> print "arrow:"
> table2 = arrow_pq.read_table(FILENAME)
> print table2.to_pandas().head()
> def read_fastparquet():
> print "fastparquet:"
> pf = ParquetFile(FILENAME)
> dff = pf.to_pandas(['A'])
> print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet(){code}
>  
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 
> 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3238) [Python] Can't read pyarrow string columns in fastparquet

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3238:

Component/s: Python

> [Python] Can't read pyarrow string columns in fastparquet
> -
>
> Key: ARROW-3238
> URL: https://issues.apache.org/jira/browse/ARROW-3238
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Theo Walker
>Priority: Major
>  Labels: parquet
>
> Writing really long strings from pyarrow causes exception in fastparquet read.
> {code:java}
> Traceback (most recent call last):
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in 
> read_fastparquet()
> File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in 
> read_fastparquet
> dff = pf.to_pandas(['A'])
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 426, in to_pandas
> index=index, assign=parts)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", 
> line 258, in read_row_group
> scheme=self.file_scheme)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 344, in read_row_group
> cats, selfmade, assign=assign)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 321, in read_row_group_arrays
> catdef=out.get(name+'-catdef', None))
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 235, in read_col
> skip_nulls, selfmade=selfmade)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
> raw_bytes = _read_page(f, header, metadata)
> File 
> "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", 
> line 31, in _read_page
> page_header.uncompressed_page_size)
> AssertionError: found 175532 raw bytes (expected 200026){code}
> If written with compression, it reports compression errors instead:
> {code:java}
> SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
> GZIP: zlib.error: Error -3 while decompressing data: incorrect header 
> check{code}
>  
>  
> Minimal code to reproduce:
> {code:java}
> import os
> import pandas as pd
> import pyarrow
> import pyarrow.parquet as arrow_pq
> from fastparquet import ParquetFile
> # data to generate
> ROW_LENGTH = 4 # decreasing below 32750ish eliminates exception
> N_ROWS = 10
> # file write params
> ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is 
> read (e.g. Nones)
> FILENAME = 'test.parquet'
> def write_arrow():
> df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
> if os.path.isfile(FILENAME):
> os.remove(FILENAME)
> arrow_table = pyarrow.Table.from_pandas(df)
> arrow_pq.write_table(arrow_table,
> FILENAME,
> use_dictionary=False,
> compression='NONE',
> row_group_size=ROW_GROUP_SIZE)
> def read_arrow():
> print "arrow:"
> table2 = arrow_pq.read_table(FILENAME)
> print table2.to_pandas().head()
> def read_fastparquet():
> print "fastparquet:"
> pf = ParquetFile(FILENAME)
> dff = pf.to_pandas(['A'])
> print dff.head()
> write_arrow()
> read_arrow()
> read_fastparquet(){code}
>  
> Versions:
> {code:java}
> fastparquet==0.1.6
> pyarrow==0.10.0
> pandas==0.22.0
> sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 
> 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'{code}
> Also opened issue here: https://github.com/dask/fastparquet/issues/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617954#comment-16617954
 ] 

Antoine Pitrou commented on ARROW-3253:
---

Ironically the toolchain is also rather slow to fetch... at least when using 
conda-forge.

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3255) [C++/Python] Migrate Travis CI jobs off Xcode 6.4

2018-09-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3255:
---

 Summary: [C++/Python] Migrate Travis CI jobs off Xcode 6.4
 Key: ARROW-3255
 URL: https://issues.apache.org/jira/browse/ARROW-3255
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


Travis CI says they are winding down their support for Xcode 6.4, which we use 
in our CI as the minimum Xcode which can build Arrow libraries

"Running builds with Xcode 6.4 in Travis CI is deprecated and will be removed 
in January 2019.
If Xcode 6.4 is critical to your builds, please contact our support team at 
supp...@travis-ci.com to discuss options.
Services are not supported on osx"

We should decide if we want to continue to support this version of Xcode, and 
what are the implications if we do not



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3254) [C++] Add option to ADD_ARROW_TEST to compose a test executable from multiple .cc files containing unit tests

2018-09-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3254:
---

 Summary: [C++] Add option to ADD_ARROW_TEST to compose a test 
executable from multiple .cc files containing unit tests
 Key: ARROW-3254
 URL: https://issues.apache.org/jira/browse/ARROW-3254
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


Currently there is a 1-1 correspondence between a .cc file containing unit 
tests to a test executable. There's good reasons (like readability, code 
organization) to split up a large test suite among many files. But there are 
downsides:

* Linking test executables is slow, especially on Windows
* Test executables take up quite a bit of space (the debug/ directory on Linux 
after a full build is ~1GB)

I suggest enabling ADD_ARROW_TEST to accept a list of files which will be build 
together into a single test. This will allow us to combine a number of our unit 
tests and save time and space



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617932#comment-16617932
 ] 

Wes McKinney commented on ARROW-3253:
-

I think our CI should just use the toolchain always for performance and we 
should move our "thirdparty testing" to a Crossbow job, so we can verify 
nightly or on demand that all the projects will build automatically from source

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617931#comment-16617931
 ] 

Wes McKinney commented on ARROW-3253:
-

Ouch. That build has some other problems -- it's building Thrift from source 
which is really slow:

{code}
-- THRIFT_HOME: 
-- Thrift compiler/libraries NOT found:  (THRIFT_INCLUDE_DIR-NOTFOUND, 
THRIFT_STATIC_LIB-NOTFOUND). Looked in system search paths.
-- Thrift include dir: 
C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/include
-- Thrift static library: 
C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/lib/thriftmd.lib
-- Thrift compiler: 
C:/projects/arrow/cpp/build/thrift_ep/src/thrift_ep-install/bin/thrift
-- Thrift version: 0.11.0
{code}

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3183) [Python] get_library_dirs on Windows can give the wrong directory

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3183:
---

Assignee: Victor Uriarte

> [Python] get_library_dirs on Windows can give the wrong directory
> -
>
> Key: ARROW-3183
> URL: https://issues.apache.org/jira/browse/ARROW-3183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0, 0.10.0
> Environment: Windows 10
> Anaconda Python 3.6
>Reporter: Victor Uriarte
>Assignee: Victor Uriarte
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Python Version: Anaconda 3.6
>  PyArrow Version: 0.9.0 and 0.10.0
>  Installed by: conda
> {{The function pa.get_library_dirs() points to the wrong directory}}
> {{import pyarrow as pa}}
>  {{print(pa.get_library_dirs())}}
> returns (Notice the extra lib in the middle of the 2nd string): 
> {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', 
> 'C:\\Anaconda\\lib\\Library\\lib']}}
> but it should be:
> {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', 
> 'C:\\Anaconda\\Library\\lib']}}
> Not sure if this is dependent on how `pyarrow` was installed on the system.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3183) [Python] get_library_dirs on Windows can give the wrong directory

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3183.
-
Resolution: Fixed

Issue resolved by pull request 2518
[https://github.com/apache/arrow/pull/2518]

> [Python] get_library_dirs on Windows can give the wrong directory
> -
>
> Key: ARROW-3183
> URL: https://issues.apache.org/jira/browse/ARROW-3183
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0, 0.10.0
> Environment: Windows 10
> Anaconda Python 3.6
>Reporter: Victor Uriarte
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Python Version: Anaconda 3.6
>  PyArrow Version: 0.9.0 and 0.10.0
>  Installed by: conda
> {{The function pa.get_library_dirs() points to the wrong directory}}
> {{import pyarrow as pa}}
>  {{print(pa.get_library_dirs())}}
> returns (Notice the extra lib in the middle of the 2nd string): 
> {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', 
> 'C:\\Anaconda\\lib\\Library\\lib']}}
> but it should be:
> {{['C:\\Anaconda\\lib\\site-packages\\pyarrow', 
> 'C:\\Anaconda\\Library\\lib']}}
> Not sure if this is dependent on how `pyarrow` was installed on the system.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617908#comment-16617908
 ] 

Antoine Pitrou commented on ARROW-3253:
---

It seems the C++ build phase is ballooning. See here, 19 minutes to end up with 
a compilation failure (no unittest executed):
https://ci.appveyor.com/project/pitrou/arrow/build/1.0.732

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3190) [C++] "WriteableFile" is misspelled, should be renamed "WritableFile" with deprecation for old name

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3190.
-
Resolution: Fixed

Issue resolved by pull request 2569
[https://github.com/apache/arrow/pull/2569]

> [C++] "WriteableFile" is misspelled, should be renamed "WritableFile" with 
> deprecation for old name
> ---
>
> Key: ARROW-3190
> URL: https://issues.apache.org/jira/browse/ARROW-3190
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See e.g. 
> https://docs.oracle.com/javase/7/docs/api/java/nio/channels/WritableByteChannel.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617885#comment-16617885
 ] 

Wes McKinney commented on ARROW-3253:
-

I suggest splitting the C++ unit tests into a separate build from the Python 
unit tests as one way to speed things up. The build times aren't _too_ bad 
though yet:

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/build/1.0.7968

> [CI] Investigate Azure CI
> -
>
> Key: ARROW-3253
> URL: https://issues.apache.org/jira/browse/ARROW-3253
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> C++ builds on AppVeyor have become slower and slower. Some of it may be due 
> to the parquet-cpp repository merge, but I also suspect CPU resources on 
> AppVeyor have become much tighter.
> We should perhaps investigate Microsoft's Azure CI services as an alternative:
> https://azure.microsoft.com/en-gb/services/devops/pipelines/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3253) [CI] Investigate Azure CI

2018-09-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3253:
-

 Summary: [CI] Investigate Azure CI
 Key: ARROW-3253
 URL: https://issues.apache.org/jira/browse/ARROW-3253
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


C++ builds on AppVeyor have become slower and slower. Some of it may be due to 
the parquet-cpp repository merge, but I also suspect CPU resources on AppVeyor 
have become much tighter.

We should perhaps investigate Microsoft's Azure CI services as an alternative:
https://azure.microsoft.com/en-gb/services/devops/pipelines/




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-09-17 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617883#comment-16617883
 ] 

Wes McKinney commented on ARROW-300:


Moving this to 0.12. I will make a proposal for compressed record batches after 
the 0.11 release goes out.

My gut instinct on this would be to create a {{CompressedBuffer}} metadata type 
and a {{CompressedRecordBatch}} message. Some reasons:

* Does not complicate or bloat the existing RecordBatch message type
* Support buffer-level compression (each buffer can be compressed or not)

Readers can choose to materialize right away or on demand -- in C++, we can 
create a {{arrow::CompressedRecordBatch}} class if we want that does late 
materialization.

This does not necessarily accommodate other kinds of type-specific compression, 
like RLE-encoding, or it might be that RLE can be used on the values buffer of 
primitive types, e.g.

{code}
CompressedBuffer {
  CompressionType type;
  int64 offset;
  int64 compressed_size;
  int64 uncompressed_size;
}
{code}

So if we wanted to use the Parquet RLE_BITPACKED_HYBRID compression style for 
integers, say, we could do that.

Another question here is how to handle compressions which may have additional 
parameters. {{CompressionType}} or {{Compression}} could be a union, but that 
would make the message sizes larger (but maybe that's OK)

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-300:
---
Fix Version/s: (was: 0.13.0)
   0.12.0

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-300:
--

Assignee: Wes McKinney

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3249) [Python] Run flake8 on integration_test.py

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3249:
---

Assignee: Wes McKinney

> [Python] Run flake8 on integration_test.py
> --
>
> Key: ARROW-3249
> URL: https://issues.apache.org/jira/browse/ARROW-3249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3196) Enable merge_arrow_py.py script to merge Parquet patches and set fix versions

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3196:
---

Assignee: Wes McKinney

> Enable merge_arrow_py.py script to merge Parquet patches and set fix versions
> -
>
> Key: ARROW-3196
> URL: https://issues.apache.org/jira/browse/ARROW-3196
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> Follow up to ARROW-3075



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3198) [Website] Blog post regarding Arrow-Parquet C++ monorepo effort

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3198:
---

Assignee: Wes McKinney

> [Website] Blog post regarding Arrow-Parquet C++ monorepo effort
> ---
>
> Key: ARROW-3198
> URL: https://issues.apache.org/jira/browse/ARROW-3198
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3076) [Website] Add Google Analytics tags to generated API documentation

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3076:
---

Assignee: Wes McKinney

> [Website] Add Google Analytics tags to generated API documentation
> --
>
> Key: ARROW-3076
> URL: https://issues.apache.org/jira/browse/ARROW-3076
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> It would be helpful to see which parts of the documentation are seeing traffic



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3197) [C++] Add instructions to cpp/README.md about Parquet-only development and Arrow+Parquet

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3197:
---

Assignee: Wes McKinney

> [C++] Add instructions to cpp/README.md about Parquet-only development and 
> Arrow+Parquet
> 
>
> Key: ARROW-3197
> URL: https://issues.apache.org/jira/browse/ARROW-3197
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> There are two distinct development workflows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3056) [Python] Indicate in NativeFile docstrings methods that are part of the RawIOBase API but not implemented

2018-09-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3056:
--
Labels: pull-request-available  (was: )

> [Python] Indicate in NativeFile docstrings methods that are part of the 
> RawIOBase API but not implemented
> -
>
> Key: ARROW-3056
> URL: https://issues.apache.org/jira/browse/ARROW-3056
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> see https://github.com/apache/arrow/issues/2422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3212) [C++] Create deterministic IPC metadata

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3212:
---

Assignee: Wes McKinney

> [C++] Create deterministic IPC metadata
> ---
>
> Key: ARROW-3212
> URL: https://issues.apache.org/jira/browse/ARROW-3212
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> Currently, the amount of padding bytes written after the IPC metadata header 
> depends on the current position of the {{OutputStream}} passed. So if the 
> message begins on an unaligned (not multiple of 8) offset, then the content 
> of the metadata will be different than if it did. This seems like a leaky 
> abstraction -- aligning the stream should probably be handled separately from 
> writing the IPC protocol.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3056) [Python] Indicate in NativeFile docstrings methods that are part of the RawIOBase API but not implemented

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3056:
---

Assignee: Wes McKinney

> [Python] Indicate in NativeFile docstrings methods that are part of the 
> RawIOBase API but not implemented
> -
>
> Key: ARROW-3056
> URL: https://issues.apache.org/jira/browse/ARROW-3056
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> see https://github.com/apache/arrow/issues/2422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods

2018-09-17 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman reassigned ARROW-2600:


Assignee: (was: Alex Hagerman)

> [Python] Add additional LocalFileSystem filesystem methods
> --
>
> Key: ARROW-2600
> URL: https://issues.apache.org/jira/browse/ARROW-2600
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Priority: Minor
>  Labels: filesystem, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the 
> methods Martin listed are also not part of the LocalFileSystem class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-25) [C++] Implement delimited file scanner / CSV reader

2018-09-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-25:

Labels: csv pull-request-available  (was: csv)

> [C++] Implement delimited file scanner / CSV reader
> ---
>
> Key: ARROW-25
> URL: https://issues.apache.org/jira/browse/ARROW-25
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
>
> Like Parquet and binary file formats, text files will be an important data 
> medium for converting to and from in-memory Arrow data. 
> pandas has some (Apache-compatible) business logic we can learn from here (as 
> one of the gold-standard CSV readers in production use)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
> https://github.com/pydata/pandas/blob/master/pandas/parser.pyx
> While very fast, this this should be largely written from scratch to target 
> the Arrow memory layout, but we can reuse certain aspects like the tokenizer 
> DFA (which originally came from the Python interpreter csv module 
> implementation)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3251) [C++] Conversion warnings in cast.cc

2018-09-17 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3251:
--
Labels: pull-request-available  (was: )

> [C++] Conversion warnings in cast.cc
> 
>
> Key: ARROW-3251
> URL: https://issues.apache.org/jira/browse/ARROW-3251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> This is with gcc 7.3.0 and {{-Wconversion}}.
> {code}
> ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void 
> arrow::compute::CastFunctor std::enable_if I>::value>::type>::operator()(arrow::compute::FunctionContext*, const 
> arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) 
> [with O = arrow::Int64Type; I = arrow::DoubleType; typename 
> std::enable_if::value>::type = void]’:
> ../src/arrow/compute/kernels/cast.cc:1105:1:   required from here
> ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type 
> {aka double}’ from ‘long int’ may alter its value [-Wconversion]
>if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
>~~^
> ../src/arrow/util/macros.h:37:50: note: in definition of macro 
> ‘ARROW_PREDICT_FALSE’
>  #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
>   ^
> ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type 
> {aka double}’ from ‘long int’ may alter its value [-Wconversion]
>if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
>~~^
> ../src/arrow/util/macros.h:37:50: note: in definition of macro 
> ‘ARROW_PREDICT_FALSE’
>  #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3252) [C++] Do not hard code the "v" part of versions in thirdparty toolchain

2018-09-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3252:
---

 Summary: [C++] Do not hard code the "v" part of versions in 
thirdparty toolchain
 Key: ARROW-3252
 URL: https://issues.apache.org/jira/browse/ARROW-3252
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.11.0


When I changed Flatbuffers from "v1.8.0" to a git hash, it broke the dependency 
download script. We should move all the version string to versions.txt rather 
than having some "v${FOO_URL}" in ThirdpartyToolchain.cmake



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3251) [C++] Conversion warnings in cast.cc

2018-09-17 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-3251:
-

Assignee: Antoine Pitrou

> [C++] Conversion warnings in cast.cc
> 
>
> Key: ARROW-3251
> URL: https://issues.apache.org/jira/browse/ARROW-3251
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> This is with gcc 7.3.0 and {{-Wconversion}}.
> {code}
> ../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void 
> arrow::compute::CastFunctor std::enable_if I>::value>::type>::operator()(arrow::compute::FunctionContext*, const 
> arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) 
> [with O = arrow::Int64Type; I = arrow::DoubleType; typename 
> std::enable_if::value>::type = void]’:
> ../src/arrow/compute/kernels/cast.cc:1105:1:   required from here
> ../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type 
> {aka double}’ from ‘long int’ may alter its value [-Wconversion]
>if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
>~~^
> ../src/arrow/util/macros.h:37:50: note: in definition of macro 
> ‘ARROW_PREDICT_FALSE’
>  #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
>   ^
> ../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type 
> {aka double}’ from ‘long int’ may alter its value [-Wconversion]
>if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
>~~^
> ../src/arrow/util/macros.h:37:50: note: in definition of macro 
> ‘ARROW_PREDICT_FALSE’
>  #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3251) [C++] Conversion warnings in cast.cc

2018-09-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3251:
-

 Summary: [C++] Conversion warnings in cast.cc
 Key: ARROW-3251
 URL: https://issues.apache.org/jira/browse/ARROW-3251
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This is with gcc 7.3.0 and {{-Wconversion}}.

{code}
../src/arrow/compute/kernels/cast.cc: In instantiation of ‘void 
arrow::compute::CastFunctor::value>::type>::operator()(arrow::compute::FunctionContext*, const 
arrow::compute::CastOptions&, const arrow::ArrayData&, arrow::ArrayData*) [with 
O = arrow::Int64Type; I = arrow::DoubleType; typename 
std::enable_if::value>::type = void]’:
../src/arrow/compute/kernels/cast.cc:1105:1:   required from here
../src/arrow/compute/kernels/cast.cc:291:45: warning: conversion to ‘in_type 
{aka double}’ from ‘long int’ may alter its value [-Wconversion]
   if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
   ~~^
../src/arrow/util/macros.h:37:50: note: in definition of macro 
‘ARROW_PREDICT_FALSE’
 #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
  ^
../src/arrow/compute/kernels/cast.cc:301:45: warning: conversion to ‘in_type 
{aka double}’ from ‘long int’ may alter its value [-Wconversion]
   if (ARROW_PREDICT_FALSE(out_value != *in_data)) {
   ~~^
../src/arrow/util/macros.h:37:50: note: in definition of macro 
‘ARROW_PREDICT_FALSE’
 #define ARROW_PREDICT_FALSE(x) (__builtin_expect(x, 0))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3157) [C++] Improve buffer creation for typed data

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3157.
-
Resolution: Fixed

Issue resolved by pull request 2566
[https://github.com/apache/arrow/pull/2566]

> [C++] Improve buffer creation for typed data
> 
>
> Key: ARROW-3157
> URL: https://issues.apache.org/jira/browse/ARROW-3157
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available, usability
> Fix For: 0.11.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> While looking into [https://github.com/apache/arrow/pull/2481,] I noticed 
> this pattern:
> {code:java}
> const uint8_t* bytes_array = reinterpret_cast(input);
> auto buffer = std::make_shared(bytes_array, 
> sizeof(float)*input_length);{code}
> It's not the end of the world but seems a little verbose to me. It would be 
> great to have something like this:
> {code:java}
> auto buffer = MakeBuffer(input, input_length);{code}
> I couldn't find it, does it already exist somewhere? Any thoughts on the API? 
> Potentially specializations to make a buffer out of a std::vector would 
> also be helpful.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3227) [Python] NativeFile.write shouldn't accept unicode strings

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3227.
-
Resolution: Fixed

Issue resolved by pull request 2570
[https://github.com/apache/arrow/pull/2570]

> [Python] NativeFile.write shouldn't accept unicode strings
> --
>
> Key: ARROW-3227
> URL: https://issues.apache.org/jira/browse/ARROW-3227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Arrow files are binary, but for some reason {{NativeFile.write}} silently 
> converts unicode strings to bytes.
> {code:python}
> >>> b = io.BytesIO()
> >>> b.write("foo")
> Traceback (most recent call last):
>   File "", line 1, in 
> b.write("foo")
> TypeError: a bytes-like object is required, not 'str'
> >>> f = pa.PythonFile(b)
> >>> f.write("foo")
> >>> b.getvalue()
> b'foo'
> >>> f.write("")
> >>> b.getvalue()
> b'foo\xf0\x9f\x98\x80'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3241) [Plasma] test_plasma_list test failure on Ubuntu 14.04

2018-09-17 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3241.
-
Resolution: Fixed

Resolved by 
https://github.com/apache/arrow/commit/c698be339b96aeb74763d70de1cf4c8789148824

> [Plasma] test_plasma_list test failure on Ubuntu 14.04
> --
>
> Key: ARROW-3241
> URL: https://issues.apache.org/jira/browse/ARROW-3241
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.11.0
>
>
> This test fails consistently for me on Ubuntu 14.04 / Python 3.6.5
> {code}
> pyarrow/tests/test_plasma.py::test_plasma_list FAILED 
>   
> [ 83%]
> >>> > 
> >>> >>>
> >>>  captured stderr 
> >>> >>>
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /dev/shm and huge page support disabled
> >> > 
> >> >>
> >>  traceback 
> >> >>
> @pytest.mark.plasma
> def test_plasma_list():
> import pyarrow.plasma as plasma
> 
> with plasma.start_plasma_store(
> plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY) \
> as (plasma_store_name, p):
> plasma_client = plasma.connect(plasma_store_name, "", 0)
> 
> # Test sizes
> u, _, _ = create_object(plasma_client, 11, metadata_size=7, 
> seal=False)
> l1 = plasma_client.list()
> assert l1[u]["data_size"] == 11
> assert l1[u]["metadata_size"] == 7
> 
> # Test ref_count
> v = plasma_client.put(np.zeros(3))
> l2 = plasma_client.list()
> # Ref count has already been released
> assert l2[v]["ref_count"] == 0
> a = plasma_client.get(v)
> l3 = plasma_client.list()
> >   assert l3[v]["ref_count"] == 1
> E   assert 0 == 1
> pyarrow/tests/test_plasma.py:825: AssertionError
>  > 
>  
>   entering PDB 
>  >
> > /home/wesm/code/arrow/python/pyarrow/tests/test_plasma.py(825)test_plasma_list()
> -> assert l3[v]["ref_count"] == 1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1669) [C++] Consider adding Abseil (Google C++11 standard library extensions) to toolchain

2018-09-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617698#comment-16617698
 ] 

Antoine Pitrou commented on ARROW-1669:
---

The baseline glibc version for manylinux1 is too old for Abseil, see ARROW-2461.

> [C++] Consider adding Abseil (Google C++11 standard library extensions) to 
> toolchain
> 
>
> Key: ARROW-1669
> URL: https://issues.apache.org/jira/browse/ARROW-1669
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Google has released a library of C++11-compliant extensions to the STL that 
> may help make a lot of Arrow code simpler:
> https://github.com/abseil/abseil-cpp/
> This code is not header-only and so would require some effort to add to the 
> toolchain at the moment since it only supports the Bazel build system



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2461) [Python] Build wheels for manylinux2010 tag

2018-09-17 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617695#comment-16617695
 ] 

Antoine Pitrou commented on ARROW-2461:
---

I also asked on distutils-sig:
https://mail.python.org/mm3/archives/list/distutils-...@python.org/thread/M4MSVY5MPAPXFWHH4PBLE6PEBPOBIA44/

> [Python] Build wheels for manylinux2010 tag
> ---
>
> Key: ARROW-2461
> URL: https://issues.apache.org/jira/browse/ARROW-2461
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Blocker
> Fix For: 0.12.0
>
>
> There is now work in progress on an updated manylinux tag based on CentOS6. 
> We should provide wheels for this tag and the old {{manylinux1}} tag for one 
> release and then switch to the new tag in the release afterwards. This should 
> enable us also to raise the minimum compiler requirement to gcc 4.9 (or 
> higher once conda-forge has migrated to a newer compiler).
> The relevant PEP is https://www.python.org/dev/peps/pep-0571/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3250) [C++] Create Buffer implementation that takes ownership for the memory from a std::string via std::move

2018-09-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3250:
---

 Summary: [C++] Create Buffer implementation that takes ownership 
for the memory from a std::string via std::move
 Key: ARROW-3250
 URL: https://issues.apache.org/jira/browse/ARROW-3250
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


There are instances where it is useful to be able retain ownership to a 
{{std::string}} owned by another as a {{arrow::Buffer}}, so we could have an 
interface like {{StlStringBuffer(std::string&& input)}} and transfer the memory 
that way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3249) [Python] Run flake8 on integration_test.py

2018-09-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3249:
---

 Summary: [Python] Run flake8 on integration_test.py
 Key: ARROW-3249
 URL: https://issues.apache.org/jira/browse/ARROW-3249
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.11.0


We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3248) [C++] Arrow tests should have label "arrow"

2018-09-17 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3248:
-

 Summary: [C++] Arrow tests should have label "arrow"
 Key: ARROW-3248
 URL: https://issues.apache.org/jira/browse/ARROW-3248
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Affects Versions: 0.10.0
Reporter: Antoine Pitrou


It would help executing only them, not Parquet unit tests which for some reason 
are quite a bit longer to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)