[jira] [Commented] (ARROW-3280) [Python] Difficulty running tests after conda install
[ https://issues.apache.org/jira/browse/ARROW-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622941#comment-16622941 ] Matthew Rocklin commented on ARROW-3280: I got as far as the following. My apologies for the rambling issue. Feel free to close. I'm also happy to reopen something else a bit cleaner. {code:java} mrocklin@carbon:~/workspace/arrow/python$ python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \ > --with-parquet --with-plasma --inplace your setuptools is too old (<12) setuptools_scm functionality is degraded Traceback (most recent call last): File "setup.py", line 589, in url="https://arrow.apache.org/"; File "/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/site-packages/setuptools/__init__.py", line 140, in setup return distutils.core.setup(**attrs) File "/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/distutils/core.py", line 108, in setup _setup_distribution = dist = klass(attrs) File "/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/site-packages/setuptools/dist.py", line 370, in __init__ k: v for k, v in attrs.items() File "/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/distutils/dist.py", line 281, in __init__ self.finalize_options() File "/home/mrocklin/Software/anaconda/envs/pyarrow-dev/lib/python3.6/site-packages/setuptools/dist.py", line 529, in finalize_options ep.load()(self, ep.name, value) File "/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/integration.py", line 19, in version_keyword File "/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/__init__.py", line 117, in get_version File "/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/__init__.py", line 69, in _do_parse File "setup.py", line 519, in parse_git return parse(root, **kwargs) File "/home/mrocklin/workspace/arrow/python/.eggs/setuptools_scm-1.15.1rc1-py3.6.egg/setuptools_scm/git.py", line 99, in parse ValueError: invalid literal for int() with base 10: 'ge2c4b09d' {code} > [Python] Difficulty running tests after conda install > - > > Key: ARROW-3280 > URL: https://issues.apache.org/jira/browse/ARROW-3280 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: conda create -n test-arrow pytest ipython pandas nomkl > pyarrow -c conda-forge > Ubuntu 16.04 >Reporter: Matthew Rocklin >Priority: Minor > Labels: python > > I install PyArrow from conda-forge, and then try running tests (or import > generally) > {code:java} > conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge > {code} > {code:java} > mrocklin@carbon:~/workspace/arrow/python$ py.test > pyarrow/tests/test_parquet.py > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 328, in _getconftestmodules > return self._path2confmods[path] > KeyError: > local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/test_parquet.py')During > handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 328, in _getconftestmodules > return self._path2confmods[path] > KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests')During > handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 359, in _importconftest > return self._conftestpath2mod[conftestpath] > KeyError: > local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py')During > handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 365, in _importconftest > mod = conftestpath.pyimport() > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/py/_path/local.py", > line 668, in pyimport > __import__(modname) > File "/home/mrocklin/workspace/arrow/python/pyarrow/__init__.py", line 54, in > > from pyarrow.lib import cpu_count, set_cpu_count > ModuleNotFoundError: No module named 'pyarrow.lib' > ERROR: could not load > /home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py{code} > Probably this is something wrong with my environment, but I thought I'd > report it as a usability bug -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3280) [Python] Difficulty running tests after conda install
[ https://issues.apache.org/jira/browse/ARROW-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622938#comment-16622938 ] Matthew Rocklin commented on ARROW-3280: Actually, let me back up. I wrote a test and wanted to run it. I tried running py.test from that directory and got the error that I list above. That's probably reasonable, given that Python is probably confused about paths given that I'm in a directory named arrow. I googled online for arrow developer notes, and eventually found that I was supposed to look in the directory for a README file. That file didn't have anything about testing in it explicitly. I see now that it has a "Build from source" section that links to external docs. I'll go and try that and see what happens. > [Python] Difficulty running tests after conda install > - > > Key: ARROW-3280 > URL: https://issues.apache.org/jira/browse/ARROW-3280 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: conda create -n test-arrow pytest ipython pandas nomkl > pyarrow -c conda-forge > Ubuntu 16.04 >Reporter: Matthew Rocklin >Priority: Minor > Labels: python > > I install PyArrow from conda-forge, and then try running tests (or import > generally) > {code:java} > conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge > {code} > {code:java} > mrocklin@carbon:~/workspace/arrow/python$ py.test > pyarrow/tests/test_parquet.py > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 328, in _getconftestmodules > return self._path2confmods[path] > KeyError: > local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/test_parquet.py')During > handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 328, in _getconftestmodules > return self._path2confmods[path] > KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests')During > handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 359, in _importconftest > return self._conftestpath2mod[conftestpath] > KeyError: > local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py')During > handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", > line 365, in _importconftest > mod = conftestpath.pyimport() > File > "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/py/_path/local.py", > line 668, in pyimport > __import__(modname) > File "/home/mrocklin/workspace/arrow/python/pyarrow/__init__.py", line 54, in > > from pyarrow.lib import cpu_count, set_cpu_count > ModuleNotFoundError: No module named 'pyarrow.lib' > ERROR: could not load > /home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py{code} > Probably this is something wrong with my environment, but I thought I'd > report it as a usability bug -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3280) [Python] Difficulty running tests after conda install
Matthew Rocklin created ARROW-3280: -- Summary: [Python] Difficulty running tests after conda install Key: ARROW-3280 URL: https://issues.apache.org/jira/browse/ARROW-3280 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.10.0 Environment: conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge Ubuntu 16.04 Reporter: Matthew Rocklin I install PyArrow from conda-forge, and then try running tests (or import generally) {code:java} conda create -n test-arrow pytest ipython pandas nomkl pyarrow -c conda-forge {code} {code:java} mrocklin@carbon:~/workspace/arrow/python$ py.test pyarrow/tests/test_parquet.py Traceback (most recent call last): File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", line 328, in _getconftestmodules return self._path2confmods[path] KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/test_parquet.py')During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", line 328, in _getconftestmodules return self._path2confmods[path] KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests')During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", line 359, in _importconftest return self._conftestpath2mod[conftestpath] KeyError: local('/home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py')During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/_pytest/config.py", line 365, in _importconftest mod = conftestpath.pyimport() File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/py/_path/local.py", line 668, in pyimport __import__(modname) File "/home/mrocklin/workspace/arrow/python/pyarrow/__init__.py", line 54, in from pyarrow.lib import cpu_count, set_cpu_count ModuleNotFoundError: No module named 'pyarrow.lib' ERROR: could not load /home/mrocklin/workspace/arrow/python/pyarrow/tests/conftest.py{code} Probably this is something wrong with my environment, but I thought I'd report it as a usability bug -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics
[ https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622923#comment-16622923 ] Wes McKinney commented on ARROW-3245: - Obviously this function _should_ try to use local filesystem methods if no argument is provided. Even better would be to instantiate {{ParquetDatasetPiece}} with a reference to the filesystem in use. Neither change is particularly difficult > [Python] Infer index and/or filtering from parquet column statistics > > > Key: ARROW-3245 > URL: https://issues.apache.org/jira/browse/ARROW-3245 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Priority: Major > Labels: parquet > > The metadata included in parquet generally gives the min/max of data for each > chunk of each column. This allows early filtering out of whole chunks if they > do not meet some criterion, and can greatly reduce reading burden in some > circumstances. In Dask, we care about this for setting an index and its > "divisions" (start/stop values for each data partition) and for directly > avoiding including some chunks in the graph of tasks to be processed. > Similarly, filtering may be applied on the values of fields defined by the > directory partitioning. > Currently, dask using the fastparquet backend is able to infer possible > columns to use as an index, perform filtering on that index and do general > filtering on any column which has statistical or partitioning information. It > would be very helpful to have such facilities via pyarrow also. > This is probably the most important of the requests from Dask. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics
[ https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622921#comment-16622921 ] Matthew Rocklin commented on ARROW-3245: After some fooling around this worked for me {{import pyarrow.parquet as pq}} {{import pandas as pd}} {{df = pd.DataFrame(\{'a': [1, 0]})}} {{df.to_parquet('out.parq', engine='pyarrow')}} {{pf = pq.ParquetDataset('out.parq')}} {{piece = pf.pieces[0]}} {{import functools}} {{piece.get_metadata(functools.partial(open, mode='rb'))}} I had to dive into the source a bit to figure out how to interpret the docstring. > [Python] Infer index and/or filtering from parquet column statistics > > > Key: ARROW-3245 > URL: https://issues.apache.org/jira/browse/ARROW-3245 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Priority: Major > Labels: parquet > > The metadata included in parquet generally gives the min/max of data for each > chunk of each column. This allows early filtering out of whole chunks if they > do not meet some criterion, and can greatly reduce reading burden in some > circumstances. In Dask, we care about this for setting an index and its > "divisions" (start/stop values for each data partition) and for directly > avoiding including some chunks in the graph of tasks to be processed. > Similarly, filtering may be applied on the values of fields defined by the > directory partitioning. > Currently, dask using the fastparquet backend is able to infer possible > columns to use as an index, perform filtering on that index and do general > filtering on any column which has statistical or partitioning information. It > would be very helpful to have such facilities via pyarrow also. > This is probably the most important of the requests from Dask. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan
[ https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622920#comment-16622920 ] Wes McKinney commented on ARROW-3244: - The pyarrow perspective is essentially agnostic to the data access pattern, but we'd like to provide APIs to do as the user wishes with the files. The basic pattern of a partitioned dataset read by a single node works fine now (that's the {{ParquetDataset}} object) Let's come up with a concrete API ask and the desired semantics with regards to when precisely the underlying file system is to be accessed, and if this is not available now, we can slate it for one of the upcoming releases. > [Python] Multi-file parquet loading without scan > > > Key: ARROW-3244 > URL: https://issues.apache.org/jira/browse/ARROW-3244 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Priority: Major > Labels: parquet > > A number of mechanism are possible to avoid having to access and read the > parquet footers in a data set consisting of a number of files. In the case of > a large number of data files (perhaps split with directory partitioning) and > remote storage, this can be a significant overhead. This is significant from > the point of view of Dask, which must have the metadata available in the > client before setting up computational graphs. > > Here are some suggestions of what could be done. > > * some parquet writing frameworks include a `_metadata` file, which contains > all the information from the footers of the various files. If this file is > present, then this data can be read from one place, with a single file > access. For a large number of files, parsing the thrift information may, by > itself, be a non-negligible overhead≥ > * the schema (dtypes) can be found in a `_common_metadata`, or from any one > of the data-files, then the schema could be assumed (perhaps at the user's > option) to be the same for all of the files. However, the information about > the directory partitioning would not be available. Although Dask may infer > the information from the filenames, it would be preferable to go through the > machinery with parquet-cpp, and view the whole data-set as a single object. > Note that the files will still need to have the footer read to access the > data, for the bytes offsets, but from Dask's point of view, this would be > deferred to tasks running in parallel. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3244) [Python] Multi-file parquet loading without scan
[ https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622917#comment-16622917 ] Matthew Rocklin commented on ARROW-3244: What happens today when someone reads a multi-file parquet dataset with dask dataframe? We read a single file to get the schema and then just build tasks for everything else? Or do we need to read through each of the files in order to find out how many row blocks are in each? On the Arrow side is this in scope? Is this already implemented? Are there mechanisms to construct the metadata files from within Arrow? If not, and if this is in scope then what is the right way / place to add this behavior? > [Python] Multi-file parquet loading without scan > > > Key: ARROW-3244 > URL: https://issues.apache.org/jira/browse/ARROW-3244 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Priority: Major > Labels: parquet > > A number of mechanism are possible to avoid having to access and read the > parquet footers in a data set consisting of a number of files. In the case of > a large number of data files (perhaps split with directory partitioning) and > remote storage, this can be a significant overhead. This is significant from > the point of view of Dask, which must have the metadata available in the > client before setting up computational graphs. > > Here are some suggestions of what could be done. > > * some parquet writing frameworks include a `_metadata` file, which contains > all the information from the footers of the various files. If this file is > present, then this data can be read from one place, with a single file > access. For a large number of files, parsing the thrift information may, by > itself, be a non-negligible overhead≥ > * the schema (dtypes) can be found in a `_common_metadata`, or from any one > of the data-files, then the schema could be assumed (perhaps at the user's > option) to be the same for all of the files. However, the information about > the directory partitioning would not be available. Although Dask may infer > the information from the filenames, it would be preferable to go through the > machinery with parquet-cpp, and view the whole data-set as a single object. > Note that the files will still need to have the footer read to access the > data, for the bytes offsets, but from Dask's point of view, this would be > deferred to tasks running in parallel. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3278) [Python] Retrieve StructType's and StructArray's field by name
[ https://issues.apache.org/jira/browse/ARROW-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622804#comment-16622804 ] Wes McKinney commented on ARROW-3278: - This should be {{field_by_name}} as with {{Schema}} > [Python] Retrieve StructType's and StructArray's field by name > -- > > Key: ARROW-3278 > URL: https://issues.apache.org/jira/browse/ARROW-3278 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3070) [Release] Host binary artifacts for RCs and releases on ASF Bintray account instead of dist/mirror system
[ https://issues.apache.org/jira/browse/ARROW-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622800#comment-16622800 ] Wes McKinney commented on ARROW-3070: - We should do what Apache Aurora is doing https://github.com/apache/aurora-packaging#hash-sign-and-upload-the-binaries cc [~xhochy] > [Release] Host binary artifacts for RCs and releases on ASF Bintray account > instead of dist/mirror system > - > > Key: ARROW-3070 > URL: https://issues.apache.org/jira/browse/ARROW-3070 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > Since the artifacts are large this is a better place for them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration
[ https://issues.apache.org/jira/browse/ARROW-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3270. - Resolution: Fixed Issue resolved by pull request 2591 [https://github.com/apache/arrow/pull/2591] > [Release] Adjust release verification scripts to recent parquet migration > - > > Key: ARROW-3270 > URL: https://issues.apache.org/jira/browse/ARROW-3270 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics
[ https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622789#comment-16622789 ] Wes McKinney commented on ARROW-3245: - You have to pass a function to that method > [Python] Infer index and/or filtering from parquet column statistics > > > Key: ARROW-3245 > URL: https://issues.apache.org/jira/browse/ARROW-3245 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Priority: Major > Labels: parquet > > The metadata included in parquet generally gives the min/max of data for each > chunk of each column. This allows early filtering out of whole chunks if they > do not meet some criterion, and can greatly reduce reading burden in some > circumstances. In Dask, we care about this for setting an index and its > "divisions" (start/stop values for each data partition) and for directly > avoiding including some chunks in the graph of tasks to be processed. > Similarly, filtering may be applied on the values of fields defined by the > directory partitioning. > Currently, dask using the fastparquet backend is able to infer possible > columns to use as an index, perform filtering on that index and do general > filtering on any column which has statistical or partitioning information. It > would be very helpful to have such facilities via pyarrow also. > This is probably the most important of the requests from Dask. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py
[ https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3249. - Resolution: Fixed Issue resolved by pull request 2590 [https://github.com/apache/arrow/pull/2590] > [Python] Run flake8 on integration_test.py and crossbow.py > -- > > Key: ARROW-3249 > URL: https://issues.apache.org/jira/browse/ARROW-3249 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3261) [Python] Add "field" method to select fields from StructArray
[ https://issues.apache.org/jira/browse/ARROW-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3261. - Resolution: Fixed Issue resolved by pull request 2586 [https://github.com/apache/arrow/pull/2586] > [Python] Add "field" method to select fields from StructArray > - > > Key: ARROW-3261 > URL: https://issues.apache.org/jira/browse/ARROW-3261 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available, usability > Fix For: 0.11.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This would improve usability. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3146) [C++] Barebones Flight RPC server and client implementations
[ https://issues.apache.org/jira/browse/ARROW-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3146. - Resolution: Fixed Fix Version/s: (was: 0.12.0) 0.11.0 Issue resolved by pull request 2547 [https://github.com/apache/arrow/pull/2547] > [C++] Barebones Flight RPC server and client implementations > > > Key: ARROW-3146 > URL: https://issues.apache.org/jira/browse/ARROW-3146 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Unsecure transport only (SSL support will require a fair bit of toolchain > work) > Depends on ARROW-249 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3279) [C++] Allow linking Arrow tests dynamically on Windows
[ https://issues.apache.org/jira/browse/ARROW-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3279: -- Labels: pull-request-available (was: ) > [C++] Allow linking Arrow tests dynamically on Windows > -- > > Key: ARROW-3279 > URL: https://issues.apache.org/jira/browse/ARROW-3279 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.10.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > On Windows, C++ modules are compiled once for each library kind (static, > shared). This means we do twice the work on e.g. AppVeyor. We should be able > to link the Arrow tests with the Arrow DLL instead, at least on Windows. > Things are a bit more complicated for Parquet because of PARQUET-1420. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3279) [C++] Allow linking Arrow tests dynamically on Windows
Antoine Pitrou created ARROW-3279: - Summary: [C++] Allow linking Arrow tests dynamically on Windows Key: ARROW-3279 URL: https://issues.apache.org/jira/browse/ARROW-3279 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.10.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou On Windows, C++ modules are compiled once for each library kind (static, shared). This means we do twice the work on e.g. AppVeyor. We should be able to link the Arrow tests with the Arrow DLL instead, at least on Windows. Things are a bit more complicated for Parquet because of PARQUET-1420. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3245) [Python] Infer index and/or filtering from parquet column statistics
[ https://issues.apache.org/jira/browse/ARROW-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621990#comment-16621990 ] Martin Durant commented on ARROW-3245: -- (pyarrow 0.10.0) ``` In [7]: df = pd.DataFrame(\{'a': [1, 0]}) In [8]: df.to_parquet('out.parq', engine='pyarrow') In [9]: pf = pq.ParquetDataset('out.parq') In [10]: pf.pieces[0].get_metadata() --- TypeError Traceback (most recent call last) in () > 1 pf.pieces[0].get_metadata() ~/anaconda/envs/tester/lib/python3.6/site-packages/pyarrow/parquet.py in get_metadata(self, open_file_func) 412 file's metadata 413 """ --> 414 return self._open(open_file_func).metadata 415 416 def _open(self, open_file_func=None): ~/anaconda/envs/tester/lib/python3.6/site-packages/pyarrow/parquet.py in _open(self, open_file_func) 418 Returns instance of ParquetFile 419 """ --> 420 reader = open_file_func(self.path) 421 if not isinstance(reader, ParquetFile): 422 reader = ParquetFile(reader) TypeError: 'NoneType' object is not callable ``` > [Python] Infer index and/or filtering from parquet column statistics > > > Key: ARROW-3245 > URL: https://issues.apache.org/jira/browse/ARROW-3245 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Martin Durant >Priority: Major > Labels: parquet > > The metadata included in parquet generally gives the min/max of data for each > chunk of each column. This allows early filtering out of whole chunks if they > do not meet some criterion, and can greatly reduce reading burden in some > circumstances. In Dask, we care about this for setting an index and its > "divisions" (start/stop values for each data partition) and for directly > avoiding including some chunks in the graph of tasks to be processed. > Similarly, filtering may be applied on the values of fields defined by the > directory partitioning. > Currently, dask using the fastparquet backend is able to infer possible > columns to use as an index, perform filtering on that index and do general > filtering on any column which has statistical or partitioning information. It > would be very helpful to have such facilities via pyarrow also. > This is probably the most important of the requests from Dask. > (please forgive that some of this has already been mentioned elsewhere; this > is one of the entries in the list at > [https://github.com/dask/fastparquet/issues/374] as a feature that is useful > in fastparquet) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3070) [Release] Host binary artifacts for RCs and releases on ASF Bintray account instead of dist/mirror system
[ https://issues.apache.org/jira/browse/ARROW-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621978#comment-16621978 ] Krisztian Szucs commented on ARROW-3070: I've never used bintray befure, but there were descriptor.json files previously in arrow-dist. Will it be fully manual for 0.11? > [Release] Host binary artifacts for RCs and releases on ASF Bintray account > instead of dist/mirror system > - > > Key: ARROW-3070 > URL: https://issues.apache.org/jira/browse/ARROW-3070 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > Since the artifacts are large this is a better place for them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3271) [Python] Manylinux1 builds timing out in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621974#comment-16621974 ] Uwe L. Korn commented on ARROW-3271: We could limit the manylinux1 builds to e.g. a single Python version to improve the build times. They are quite important as they are the lower bound in the compiler versions that we support. > [Python] Manylinux1 builds timing out in Travis CI > -- > > Key: ARROW-3271 > URL: https://issues.apache.org/jira/browse/ARROW-3271 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > > Not sure why this is happening -- I think the docker pull has been a lot > slower of late -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3243) [C++] Upgrade jemalloc to version 5
[ https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621971#comment-16621971 ] Uwe L. Korn commented on ARROW-3243: The patch we have is solely relevant for jemalloc-4, it was already in the released jemalloc-5 branch. Sadly jemalloc 5 had some changes that made it unusable in the {{manylinux1}} setting. It could be that these are resolved, then we could switch to a newer version. You can simply try this by changing the installation script. Otherwise we probably have to wait until we have changed our wheel to be based on {{manylinux2010}}. > [C++] Upgrade jemalloc to version 5 > --- > > Key: ARROW-3243 > URL: https://issues.apache.org/jira/browse/ARROW-3243 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Philipp Moritz >Priority: Major > > Is it possible/feasible to upgrade jemalloc to version 5 and assume that > version? I'm asking because I've been working towards replacing dlmalloc in > plasma with jemalloc, which makes some of the code much nicer and removes > some of the issues we had with dlmalloc, but it requires jemalloc APIs that > are only available starting from jemalloc version 5, in particular, I'm using > the extent_hooks_t capability. > For now I can submit a patch that uses a different version of jemalloc in > plasma and then we can figure out how to deal with it (maybe there is a way > to make it work with older versions). What are your thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3141) [Python] Tensorflow support in pyarrow wheels pins numpy>=1.14
[ https://issues.apache.org/jira/browse/ARROW-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621952#comment-16621952 ] Uwe L. Korn commented on ARROW-3141: I'm not really sure how long people stay on old NumPy versions. I guess we can increase the minimal version. Still, we should be very careful about the NumPy version in our builds and should not let it update automatically. > [Python] Tensorflow support in pyarrow wheels pins numpy>=1.14 > -- > > Key: ARROW-3141 > URL: https://issues.apache.org/jira/browse/ARROW-3141 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.10.0 >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > This was introduced by https://github.com/apache/arrow/pull/2104/files > Two options: > * Don't build with tensorflow support by default > * Increase our minimal support NumPy version to 1.14 overall -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration
[ https://issues.apache.org/jira/browse/ARROW-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3270: -- Labels: pull-request-available (was: ) > [Release] Adjust release verification scripts to recent parquet migration > - > > Key: ARROW-3270 > URL: https://issues.apache.org/jira/browse/ARROW-3270 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration
[ https://issues.apache.org/jira/browse/ARROW-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-3270: -- Assignee: Krisztian Szucs > [Release] Adjust release verification scripts to recent parquet migration > - > > Key: ARROW-3270 > URL: https://issues.apache.org/jira/browse/ARROW-3270 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-3267. Resolution: Fixed Issue resolved by pull request 2589 [https://github.com/apache/arrow/pull/2589] > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py
[ https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3249: -- Labels: pull-request-available (was: ) > [Python] Run flake8 on integration_test.py and crossbow.py > -- > > Key: ARROW-3249 > URL: https://issues.apache.org/jira/browse/ARROW-3249 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > > We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3249) [Python] Run flake8 on integration_test.py and crossbow.py
[ https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-3249: --- Summary: [Python] Run flake8 on integration_test.py and crossbow.py (was: [Python] Run flake8 on integration_test.py) > [Python] Run flake8 on integration_test.py and crossbow.py > -- > > Key: ARROW-3249 > URL: https://issues.apache.org/jira/browse/ARROW-3249 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Fix For: 0.11.0 > > > We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3249) [Python] Run flake8 on integration_test.py
[ https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621840#comment-16621840 ] Krisztian Szucs commented on ARROW-3249: Crossbow too > [Python] Run flake8 on integration_test.py > -- > > Key: ARROW-3249 > URL: https://issues.apache.org/jira/browse/ARROW-3249 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Fix For: 0.11.0 > > > We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3249) [Python] Run flake8 on integration_test.py
[ https://issues.apache.org/jira/browse/ARROW-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-3249: -- Assignee: Krisztian Szucs (was: Wes McKinney) > [Python] Run flake8 on integration_test.py > -- > > Key: ARROW-3249 > URL: https://issues.apache.org/jira/browse/ARROW-3249 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Fix For: 0.11.0 > > > We should keep this code clean, too -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3278) [Python] Retrieve StructType's and StructArray's field by name
Krisztian Szucs created ARROW-3278: -- Summary: [Python] Retrieve StructType's and StructArray's field by name Key: ARROW-3278 URL: https://issues.apache.org/jira/browse/ARROW-3278 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3069) [Release] Stop using SHA1 checksums per ASF policy
[ https://issues.apache.org/jira/browse/ARROW-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-3069. Resolution: Fixed Issue resolved by pull request 2584 [https://github.com/apache/arrow/pull/2584] > [Release] Stop using SHA1 checksums per ASF policy > -- > > Key: ARROW-3069 > URL: https://issues.apache.org/jira/browse/ARROW-3069 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://www.apache.org/dev/release-distribution#sigs-and-sums -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3262) [Python] Implement __getitem__ with integers on pyarrow.Column
[ https://issues.apache.org/jira/browse/ARROW-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-3262. Resolution: Fixed Issue resolved by pull request 2585 [https://github.com/apache/arrow/pull/2585] > [Python] Implement __getitem__ with integers on pyarrow.Column > -- > > Key: ARROW-3262 > URL: https://issues.apache.org/jira/browse/ARROW-3262 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available, usability > Fix For: 0.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This would improve interactive usability -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3267: -- Labels: pull-request-available (was: ) > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621664#comment-16621664 ] Uwe L. Korn commented on ARROW-3267: [~Paul.Rogers] We already have the necessary builder infrastructure, this function is mainly to have something to pass around when there is no data. Also the {{Table}} instance is not meant to be modified, i.e. it will stay empty all along the pipeline. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)