date:20180920

[jira] [Commented] (ARROW-3280) [Python] Difficulty running tests after conda install

2018-09-20 Thread Matthew Rocklin (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622941#comment-16622941
 ] 

Matthew Rocklin commented on ARROW-3280:


I got as far as the following.  My apologies for the rambling issue.  Feel free 
to close.  I'm also happy to reopen something else a bit cleaner.
{code:java}
mrocklin@carbon:~/workspace/arrow/python$ python setup.py build_ext 
--build-type=$ARROW_BUILD_TYPE \
> --with-parquet --with-plasma --inplace
your setuptools is too old (<12)
setuptools_scm functionality is degraded
Traceback (most recent call last):
File "setup.py", line 589, in 
url="https://arrow.apache.org/";
File 
"/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/site-packages/setuptools/__init__.py",
 line 140, in setup
return distutils.core.setup(**attrs)
File 
"/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/distutils/core.py",
 line 108, in setup
_setup_distribution = dist = klass(attrs)
File 
"/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/site-packages/setuptools/dist.py",
 line 370, in __init__
k: v for k, v in attrs.items()
File 
"/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/distutils/dist.py",
 line 281, in __init__
self.finalize_options()
File 
"/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/site-packages/setuptools/dist.py",
 line 529, in finalize_options
ep.load()(self, ep.name, value)
File 
"/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/integration.py",
 line 19, in version_keyword
File 
"/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/__init__.py",
 line 117, in get_version
File 
"/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/__init__.py",
 line 69, in _do_parse
File "setup.py", line 519, in parse_git
return parse(root, **kwargs)
File 
"/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/git.py",
 line 99, in parse
ValueError: invalid literal for int() with base 10: 'ge2c4b09d'
{code}

> [Python] Difficulty running tests after conda install
> -
>
> Key: ARROW-3280
> URL: https://issues.apache.org/jira/browse/ARROW-3280
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: conda create -n test-arrow pytest ipython pandas nomkl 
> pyarrow -c conda-forge
> Ubuntu 16.04
>Reporter: Matthew Rocklin
>Priority: Minor
>  Labels: python
>
> I install PyArrow from conda-forge, and then try running tests (or import 
> generally)
> {code:java}
> conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge 
> {code}
> {code:java}
> mrocklin@carbon:~/workspace/arrow/python$ py.test 
> pyarrow/tests/test_parquet.py 
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 328, in _getconftestmodules
> return self._path2confmods[path]
> KeyError: 
> local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/test_parquet.py')During
>  handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 328, in _getconftestmodules
> return self._path2confmods[path]
> KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests')During 
> handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 359, in _importconftest
> return self._conftestpath2mod[conftestpath]
> KeyError: 
> local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py')During
>  handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 365, in _importconftest
> mod = conftestpath.pyimport()
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/py/_path/local.py",
>  line 668, in pyimport
> __import__(modname)
> File "/home/mrocklin/workspace/arrow/python/pyarrow/__init__.py", line 54, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ModuleNotFoundError: No module named 'pyarrow.lib'
> ERROR: could not load 
> /home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py{code}
> Probably this is something wrong with my environment, but I thought I'd 
> report it as a usability bug



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3280) [Python] Difficulty running tests after conda install

2018-09-20 Thread Matthew Rocklin (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622938#comment-16622938
 ] 

Matthew Rocklin commented on ARROW-3280:


Actually, let me back up.  I wrote a test and wanted to run it.  I tried 
running py.test from that directory and got the error that I list above.  
That's probably reasonable, given that Python is probably confused about paths 
given that I'm in a directory named arrow.  

I googled online for arrow developer notes, and eventually found that I was 
supposed to look in the directory for a README file.  That file didn't have 
anything about testing in it explicitly.  

I see now that it has a "Build from source" section that links to external 
docs.  I'll go and try that and see what happens.

> [Python] Difficulty running tests after conda install
> -
>
> Key: ARROW-3280
> URL: https://issues.apache.org/jira/browse/ARROW-3280
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: conda create -n test-arrow pytest ipython pandas nomkl 
> pyarrow -c conda-forge
> Ubuntu 16.04
>Reporter: Matthew Rocklin
>Priority: Minor
>  Labels: python
>
> I install PyArrow from conda-forge, and then try running tests (or import 
> generally)
> {code:java}
> conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge 
> {code}
> {code:java}
> mrocklin@carbon:~/workspace/arrow/python$ py.test 
> pyarrow/tests/test_parquet.py 
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 328, in _getconftestmodules
> return self._path2confmods[path]
> KeyError: 
> local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/test_parquet.py')During
>  handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 328, in _getconftestmodules
> return self._path2confmods[path]
> KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests')During 
> handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 359, in _importconftest
> return self._conftestpath2mod[conftestpath]
> KeyError: 
> local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py')During
>  handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
>  line 365, in _importconftest
> mod = conftestpath.pyimport()
> File 
> "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/py/_path/local.py",
>  line 668, in pyimport
> __import__(modname)
> File "/home/mrocklin/workspace/arrow/python/pyarrow/__init__.py", line 54, in 
> 
> from pyarrow.lib import cpu_count, set_cpu_count
> ModuleNotFoundError: No module named 'pyarrow.lib'
> ERROR: could not load 
> /home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py{code}
> Probably this is something wrong with my environment, but I thought I'd 
> report it as a usability bug



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3280) [Python] Difficulty running tests after conda install

2018-09-20 Thread Matthew Rocklin (JIRA)

Matthew Rocklin created ARROW-3280:
--

 Summary: [Python] Difficulty running tests after conda install
 Key: ARROW-3280
 URL: https://issues.apache.org/jira/browse/ARROW-3280
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.10.0
 Environment: conda create -n test-arrow pytest ipython pandas nomkl 
pyarrow -c conda-forge

Ubuntu 16.04
Reporter: Matthew Rocklin


I install PyArrow from conda-forge, and then try running tests (or import 
generally)
{code:java}
conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge 
{code}
{code:java}
mrocklin@carbon:~/workspace/arrow/python$ py.test pyarrow/tests/test_parquet.py 
Traceback (most recent call last):
File 
"/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
 line 328, in _getconftestmodules
return self._path2confmods[path]
KeyError: 
local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/test_parquet.py')During
 handling of the above exception, another exception occurred:
Traceback (most recent call last):
File 
"/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
 line 328, in _getconftestmodules
return self._path2confmods[path]
KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests')During 
handling of the above exception, another exception occurred:
Traceback (most recent call last):
File 
"/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
 line 359, in _importconftest
return self._conftestpath2mod[conftestpath]
KeyError: 
local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py')During 
handling of the above exception, another exception occurred:
Traceback (most recent call last):
File 
"/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py",
 line 365, in _importconftest
mod = conftestpath.pyimport()
File 
"/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/py/_path/local.py",
 line 668, in pyimport
__import__(modname)
File "/home/mrocklin/workspace/arrow/python/pyarrow/__init__.py", line 54, in 

from pyarrow.lib import cpu_count, set_cpu_count
ModuleNotFoundError: No module named 'pyarrow.lib'
ERROR: could not load 
/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py{code}
Probably this is something wrong with my environment, but I thought I'd report 
it as a usability bug



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

2018-09-20 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622923#comment-16622923
 ] 

Wes McKinney commented on ARROW-3245:
-

Obviously this function _should_ try to use local filesystem methods if no 
argument is provided. Even better would be to instantiate 
{{ParquetDatasetPiece}} with a reference to the filesystem in use. Neither 
change is particularly difficult

> [Python] Infer index and/or filtering from parquet column statistics
> 
>
> Key: ARROW-3245
> URL: https://issues.apache.org/jira/browse/ARROW-3245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> The metadata included in parquet generally gives the min/max of data for each 
> chunk of each column. This allows early filtering out of whole chunks if they 
> do not meet some criterion, and can greatly reduce reading burden in some 
> circumstances. In Dask, we care about this for setting an index and its 
> "divisions" (start/stop values for each data partition) and for directly 
> avoiding including some chunks in the graph of tasks to be processed. 
> Similarly, filtering may be applied on the values of fields defined by the 
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible 
> columns to use as an index, perform filtering on that index and do general 
> filtering on any column which has statistical or partitioning information. It 
> would be very helpful to have such facilities via pyarrow also.
>  This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

2018-09-20 Thread Matthew Rocklin (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622921#comment-16622921
 ] 

Matthew Rocklin commented on ARROW-3245:


After some fooling around this worked for me

{{import pyarrow.parquet as pq}}
{{import pandas as pd}}
{{df = pd.DataFrame(\{'a': [1, 0]})}}
{{df.to_parquet('out.parq', engine='pyarrow')}}
{{pf = pq.ParquetDataset('out.parq')}}
{{piece = pf.pieces[0]}}
{{import functools}}
{{piece.get_metadata(functools.partial(open, mode='rb'))}}

I had to dive into the source a bit to figure out how to interpret the 
docstring.

> [Python] Infer index and/or filtering from parquet column statistics
> 
>
> Key: ARROW-3245
> URL: https://issues.apache.org/jira/browse/ARROW-3245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> The metadata included in parquet generally gives the min/max of data for each 
> chunk of each column. This allows early filtering out of whole chunks if they 
> do not meet some criterion, and can greatly reduce reading burden in some 
> circumstances. In Dask, we care about this for setting an index and its 
> "divisions" (start/stop values for each data partition) and for directly 
> avoiding including some chunks in the graph of tasks to be processed. 
> Similarly, filtering may be applied on the values of fields defined by the 
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible 
> columns to use as an index, perform filtering on that index and do general 
> filtering on any column which has statistical or partitioning information. It 
> would be very helpful to have such facilities via pyarrow also.
>  This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

2018-09-20 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622920#comment-16622920
 ] 

Wes McKinney commented on ARROW-3244:
-

The pyarrow perspective is essentially agnostic to the data access pattern, but 
we'd like to provide APIs to do as the user wishes with the files. The basic 
pattern of a partitioned dataset read by a single node works fine now (that's 
the {{ParquetDataset}} object)

Let's come up with a concrete API ask and the desired semantics with regards to 
when precisely the underlying file system is to be accessed, and if this is not 
available now, we can slate it for one of the upcoming releases. 

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

2018-09-20 Thread Matthew Rocklin (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622917#comment-16622917
 ] 

Matthew Rocklin commented on ARROW-3244:


What happens today when someone reads a multi-file parquet dataset with dask 
dataframe?  We read a single file to get the schema and then just build tasks 
for everything else?  Or do we need to read through each of the files in order 
to find out how many row blocks are in each?

On the Arrow side is this in scope?  Is this already implemented?  Are there 
mechanisms to construct the metadata files from within Arrow?  If not, and if 
this is in scope then what is the right way / place to add this behavior?

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3278) [Python] Retrieve StructType's and StructArray's field by name

2018-09-20 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622804#comment-16622804
 ] 

Wes McKinney commented on ARROW-3278:
-

This should be {{field_by_name}} as with {{Schema}}

> [Python] Retrieve StructType's and StructArray's field by name
> --
>
> Key: ARROW-3278
> URL: https://issues.apache.org/jira/browse/ARROW-3278
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3070) [Release] Host binary artifacts for RCs and releases on ASF Bintray account instead of dist/mirror system

2018-09-20 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622800#comment-16622800
 ] 

Wes McKinney commented on ARROW-3070:
-

We should do what Apache Aurora is doing

https://github.com/apache/aurora-packaging#hash-sign-and-upload-the-binaries

cc [~xhochy]

> [Release] Host binary artifacts for RCs and releases on ASF Bintray account 
> instead of dist/mirror system
> -
>
> Key: ARROW-3070
> URL: https://issues.apache.org/jira/browse/ARROW-3070
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> Since the artifacts are large this is a better place for them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

2018-09-20 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3270.
-
Resolution: Fixed

Issue resolved by pull request 2591
[https://github.com/apache/arrow/pull/2591]

> [Release] Adjust release verification scripts to recent parquet migration
> -
>
> Key: ARROW-3270
> URL: https://issues.apache.org/jira/browse/ARROW-3270
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

2018-09-20 Thread Wes McKinney (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622789#comment-16622789
 ] 

Wes McKinney commented on ARROW-3245:
-

You have to pass a function to that method

> [Python] Infer index and/or filtering from parquet column statistics
> 
>
> Key: ARROW-3245
> URL: https://issues.apache.org/jira/browse/ARROW-3245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> The metadata included in parquet generally gives the min/max of data for each 
> chunk of each column. This allows early filtering out of whole chunks if they 
> do not meet some criterion, and can greatly reduce reading burden in some 
> circumstances. In Dask, we care about this for setting an index and its 
> "divisions" (start/stop values for each data partition) and for directly 
> avoiding including some chunks in the graph of tasks to be processed. 
> Similarly, filtering may be applied on the values of fields defined by the 
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible 
> columns to use as an index, perform filtering on that index and do general 
> filtering on any column which has statistical or partitioning information. It 
> would be very helpful to have such facilities via pyarrow also.
>  This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py

2018-09-20 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3249.
-
Resolution: Fixed

Issue resolved by pull request 2590
[https://github.com/apache/arrow/pull/2590]

> [Python] Run flake8 on integration_test.py and crossbow.py
> --
>
> Key: ARROW-3249
> URL: https://issues.apache.org/jira/browse/ARROW-3249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3261) [Python] Add "field" method to select fields from StructArray

2018-09-20 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3261.
-
Resolution: Fixed

Issue resolved by pull request 2586
[https://github.com/apache/arrow/pull/2586]

> [Python] Add "field" method to select fields from StructArray
> -
>
> Key: ARROW-3261
> URL: https://issues.apache.org/jira/browse/ARROW-3261
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available, usability
> Fix For: 0.11.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This would improve usability. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3146) [C++] Barebones Flight RPC server and client implementations

2018-09-20 Thread Wes McKinney (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3146.
-
   Resolution: Fixed
Fix Version/s: (was: 0.12.0)
   0.11.0

Issue resolved by pull request 2547
[https://github.com/apache/arrow/pull/2547]

> [C++] Barebones Flight RPC server and client implementations
> 
>
> Key: ARROW-3146
> URL: https://issues.apache.org/jira/browse/ARROW-3146
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Unsecure transport only (SSL support will require a fair bit of toolchain 
> work)
> Depends on ARROW-249



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3279) [C++] Allow linking Arrow tests dynamically on Windows

2018-09-20 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3279:
--
Labels: pull-request-available  (was: )

> [C++] Allow linking Arrow tests dynamically on Windows
> --
>
> Key: ARROW-3279
> URL: https://issues.apache.org/jira/browse/ARROW-3279
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> On Windows, C++ modules are compiled once for each library kind (static, 
> shared). This means we do twice the work on e.g. AppVeyor. We should be able 
> to link the Arrow tests with the Arrow DLL instead, at least on Windows.
> Things are a bit more complicated for Parquet because of PARQUET-1420.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3279) [C++] Allow linking Arrow tests dynamically on Windows

2018-09-20 Thread Antoine Pitrou (JIRA)

Antoine Pitrou created ARROW-3279:
-

 Summary: [C++] Allow linking Arrow tests dynamically on Windows
 Key: ARROW-3279
 URL: https://issues.apache.org/jira/browse/ARROW-3279
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.10.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


On Windows, C++ modules are compiled once for each library kind (static, 
shared). This means we do twice the work on e.g. AppVeyor. We should be able to 
link the Arrow tests with the Arrow DLL instead, at least on Windows.

Things are a bit more complicated for Parquet because of PARQUET-1420.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

2018-09-20 Thread Martin Durant (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621990#comment-16621990
 ] 

Martin Durant commented on ARROW-3245:
--

(pyarrow 0.10.0)

```

In [7]: df = pd.DataFrame(\{'a': [1, 0]})

In [8]: df.to_parquet('out.parq', engine='pyarrow')

In [9]: pf = pq.ParquetDataset('out.parq')

In [10]: pf.pieces[0].get_metadata()
---
TypeError Traceback (most recent call last)
 in ()
> 1 pf.pieces[0].get_metadata()

~/anaconda/envs/tester/lib/python3.6/site-packages/pyarrow/parquet.py in 
get_metadata(self, open_file_func)
 412 file's metadata
 413 """
--> 414 return self._open(open_file_func).metadata
 415
 416 def _open(self, open_file_func=None):

~/anaconda/envs/tester/lib/python3.6/site-packages/pyarrow/parquet.py in 
_open(self, open_file_func)
 418 Returns instance of ParquetFile
 419 """
--> 420 reader = open_file_func(self.path)
 421 if not isinstance(reader, ParquetFile):
 422 reader = ParquetFile(reader)

TypeError: 'NoneType' object is not callable

```

> [Python] Infer index and/or filtering from parquet column statistics
> 
>
> Key: ARROW-3245
> URL: https://issues.apache.org/jira/browse/ARROW-3245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: parquet
>
> The metadata included in parquet generally gives the min/max of data for each 
> chunk of each column. This allows early filtering out of whole chunks if they 
> do not meet some criterion, and can greatly reduce reading burden in some 
> circumstances. In Dask, we care about this for setting an index and its 
> "divisions" (start/stop values for each data partition) and for directly 
> avoiding including some chunks in the graph of tasks to be processed. 
> Similarly, filtering may be applied on the values of fields defined by the 
> directory partitioning.
> Currently, dask using the fastparquet backend is able to infer possible 
> columns to use as an index, perform filtering on that index and do general 
> filtering on any column which has statistical or partitioning information. It 
> would be very helpful to have such facilities via pyarrow also.
>  This is probably the most important of the requests from Dask.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3070) [Release] Host binary artifacts for RCs and releases on ASF Bintray account instead of dist/mirror system

2018-09-20 Thread Krisztian Szucs (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621978#comment-16621978
 ] 

Krisztian Szucs commented on ARROW-3070:


I've never used bintray befure, but there were descriptor.json files previously 
in arrow-dist. Will it be fully manual for 0.11?

> [Release] Host binary artifacts for RCs and releases on ASF Bintray account 
> instead of dist/mirror system
> -
>
> Key: ARROW-3070
> URL: https://issues.apache.org/jira/browse/ARROW-3070
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> Since the artifacts are large this is a better place for them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3271) [Python] Manylinux1 builds timing out in Travis CI

2018-09-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621974#comment-16621974
 ] 

Uwe L. Korn commented on ARROW-3271:


We could limit the manylinux1 builds to e.g. a single Python version to improve 
the build times. They are quite important as they are the lower bound in the 
compiler versions that we support.

> [Python] Manylinux1 builds timing out in Travis CI
> --
>
> Key: ARROW-3271
> URL: https://issues.apache.org/jira/browse/ARROW-3271
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>
> Not sure why this is happening -- I think the docker pull has been a lot 
> slower of late



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3243) [C++] Upgrade jemalloc to version 5

2018-09-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621971#comment-16621971
 ] 

Uwe L. Korn commented on ARROW-3243:


The patch we have is solely relevant for jemalloc-4, it was already in the 
released jemalloc-5 branch. Sadly jemalloc 5 had some changes that made it 
unusable in the {{manylinux1}} setting. It could be that these are resolved, 
then we could switch to a newer version. You can simply try this by changing 
the installation script. Otherwise we probably have to wait until we have 
changed our wheel to be based on {{manylinux2010}}.

> [C++] Upgrade jemalloc to version 5
> ---
>
> Key: ARROW-3243
> URL: https://issues.apache.org/jira/browse/ARROW-3243
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>
> Is it possible/feasible to upgrade jemalloc to version 5 and assume that 
> version? I'm asking because I've been working towards replacing dlmalloc in 
> plasma with jemalloc, which makes some of the code much nicer and removes 
> some of the issues we had with dlmalloc, but it requires jemalloc APIs that 
> are only available starting from jemalloc version 5, in particular, I'm using 
> the extent_hooks_t capability.
> For now I can submit a patch that uses a different version of jemalloc in 
> plasma and then we can figure out how to deal with it (maybe there is a way 
> to make it work with older versions). What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3141) [Python] Tensorflow support in pyarrow wheels pins numpy>=1.14

2018-09-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621952#comment-16621952
 ] 

Uwe L. Korn commented on ARROW-3141:


I'm not really sure how long people stay on old NumPy versions. I guess we can 
increase the minimal version. Still, we should be very careful about the NumPy 
version in our builds and should not let it update automatically.

> [Python] Tensorflow support in pyarrow wheels pins numpy>=1.14
> --
>
> Key: ARROW-3141
> URL: https://issues.apache.org/jira/browse/ARROW-3141
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.10.0
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> This was introduced by https://github.com/apache/arrow/pull/2104/files
> Two options:
> * Don't build with tensorflow support by default
> * Increase our minimal support NumPy version to 1.14 overall



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

2018-09-20 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3270:
--
Labels: pull-request-available  (was: )

> [Release] Adjust release verification scripts to recent parquet migration
> -
>
> Key: ARROW-3270
> URL: https://issues.apache.org/jira/browse/ARROW-3270
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

2018-09-20 Thread Krisztian Szucs (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-3270:
--

Assignee: Krisztian Szucs

> [Release] Adjust release verification scripts to recent parquet migration
> -
>
> Key: ARROW-3270
> URL: https://issues.apache.org/jira/browse/ARROW-3270
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3267) [Python] Create empty table from schema

2018-09-20 Thread Krisztian Szucs (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-3267.

Resolution: Fixed

Issue resolved by pull request 2589
[https://github.com/apache/arrow/pull/2589]

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py

2018-09-20 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3249:
--
Labels: pull-request-available  (was: )

> [Python] Run flake8 on integration_test.py and crossbow.py
> --
>
> Key: ARROW-3249
> URL: https://issues.apache.org/jira/browse/ARROW-3249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py

2018-09-20 Thread Krisztian Szucs (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-3249:
---
Summary: [Python] Run flake8 on integration_test.py and crossbow.py  (was: 
[Python] Run flake8 on integration_test.py)

> [Python] Run flake8 on integration_test.py and crossbow.py
> --
>
> Key: ARROW-3249
> URL: https://issues.apache.org/jira/browse/ARROW-3249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.11.0
>
>
> We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3249) [Python] Run flake8 on integration_test.py

2018-09-20 Thread Krisztian Szucs (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621840#comment-16621840
 ] 

Krisztian Szucs commented on ARROW-3249:


Crossbow too

> [Python] Run flake8 on integration_test.py
> --
>
> Key: ARROW-3249
> URL: https://issues.apache.org/jira/browse/ARROW-3249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.11.0
>
>
> We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-3249) [Python] Run flake8 on integration_test.py

2018-09-20 Thread Krisztian Szucs (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-3249:
--

Assignee: Krisztian Szucs  (was: Wes McKinney)

> [Python] Run flake8 on integration_test.py
> --
>
> Key: ARROW-3249
> URL: https://issues.apache.org/jira/browse/ARROW-3249
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 0.11.0
>
>
> We should keep this code clean, too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-3278) [Python] Retrieve StructType's and StructArray's field by name

2018-09-20 Thread Krisztian Szucs (JIRA)

Krisztian Szucs created ARROW-3278:
--

 Summary: [Python] Retrieve StructType's and StructArray's field by 
name
 Key: ARROW-3278
 URL: https://issues.apache.org/jira/browse/ARROW-3278
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3069) [Release] Stop using SHA1 checksums per ASF policy

2018-09-20 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-3069.

Resolution: Fixed

Issue resolved by pull request 2584
[https://github.com/apache/arrow/pull/2584]

> [Release] Stop using SHA1 checksums per ASF policy
> --
>
> Key: ARROW-3069
> URL: https://issues.apache.org/jira/browse/ARROW-3069
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://www.apache.org/dev/release-distribution#sigs-and-sums



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-3262) [Python] Implement getitem with integers on pyarrow.Column

2018-09-20 Thread Uwe L. Korn (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-3262.

Resolution: Fixed

Issue resolved by pull request 2585
[https://github.com/apache/arrow/pull/2585]

> [Python] Implement __getitem__ with integers on pyarrow.Column
> --
>
> Key: ARROW-3262
> URL: https://issues.apache.org/jira/browse/ARROW-3262
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available, usability
> Fix For: 0.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This would improve interactive usability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3267) [Python] Create empty table from schema

2018-09-20 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3267:
--
Labels: pull-request-available  (was: )

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-20 Thread Uwe L. Korn (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621664#comment-16621664
 ] 

Uwe L. Korn commented on ARROW-3267:


[~Paul.Rogers] We already have the necessary builder infrastructure, this 
function is mainly to have something to pass around when there is no data. Also 
the {{Table}} instance is not meant to be modified, i.e. it will stay empty all 
along the pipeline.

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3280) [Python] Difficulty running tests after conda install

[jira] [Commented] (ARROW-3280) [Python] Difficulty running tests after conda install

[jira] [Created] (ARROW-3280) [Python] Difficulty running tests after conda install

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan

[jira] [Commented] (ARROW-3278) [Python] Retrieve StructType's and StructArray's field by name

[jira] [Commented] (ARROW-3070) [Release] Host binary artifacts for RCs and releases on ASF Bintray account instead of dist/mirror system

[jira] [Resolved] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

[jira] [Resolved] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py

[jira] [Resolved] (ARROW-3261) [Python] Add "field" method to select fields from StructArray

[jira] [Resolved] (ARROW-3146) [C++] Barebones Flight RPC server and client implementations

[jira] [Updated] (ARROW-3279) [C++] Allow linking Arrow tests dynamically on Windows

[jira] [Created] (ARROW-3279) [C++] Allow linking Arrow tests dynamically on Windows

[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics

[jira] [Commented] (ARROW-3070) [Release] Host binary artifacts for RCs and releases on ASF Bintray account instead of dist/mirror system

[jira] [Commented] (ARROW-3271) [Python] Manylinux1 builds timing out in Travis CI

[jira] [Commented] (ARROW-3243) [C++] Upgrade jemalloc to version 5

[jira] [Commented] (ARROW-3141) [Python] Tensorflow support in pyarrow wheels pins numpy>=1.14

[jira] [Updated] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

[jira] [Assigned] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

[jira] [Resolved] (ARROW-3267) [Python] Create empty table from schema

[jira] [Updated] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py

[jira] [Updated] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py

[jira] [Commented] (ARROW-3249) [Python] Run flake8 on integration_test.py

[jira] [Assigned] (ARROW-3249) [Python] Run flake8 on integration_test.py

[jira] [Created] (ARROW-3278) [Python] Retrieve StructType's and StructArray's field by name

[jira] [Resolved] (ARROW-3069) [Release] Stop using SHA1 checksums per ASF policy

[jira] [Resolved] (ARROW-3262) [Python] Implement getitem with integers on pyarrow.Column

[jira] [Updated] (ARROW-3267) [Python] Create empty table from schema

[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

33 matches

Site Navigation

Mail list logo

Footer information