[jira] [Commented] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674043#comment-16674043 ] Alex Hagerman commented on SPARK-25933: --- https://github.com/apache/spark/pull/22933 > Fix pstats reference for spark.python.profile.dump in configuration.md > -- > > Key: SPARK-25933 > URL: https://issues.apache.org/jira/browse/SPARK-25933 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.2 >Reporter: Alex Hagerman >Priority: Trivial > Labels: documentation, pull-request-available > Fix For: 2.3.2 > > Original Estimate: 5m > Remaining Estimate: 5m > > ptats.Stats() should be pstats.Stats() in > https://spark.apache.org/docs/latest/configuration.html for > spark.python.profile.dump. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated SPARK-25933: -- Labels: documentation pull-request-available (was: documentation) > Fix pstats reference for spark.python.profile.dump in configuration.md > -- > > Key: SPARK-25933 > URL: https://issues.apache.org/jira/browse/SPARK-25933 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.2 >Reporter: Alex Hagerman >Priority: Trivial > Labels: documentation, pull-request-available > Fix For: 2.3.2 > > Original Estimate: 5m > Remaining Estimate: 5m > > ptats.Stats() should be pstats.Stats() in > https://spark.apache.org/docs/latest/configuration.html for > spark.python.profile.dump. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md
Alex Hagerman created SPARK-25933: - Summary: Fix pstats reference for spark.python.profile.dump in configuration.md Key: SPARK-25933 URL: https://issues.apache.org/jira/browse/SPARK-25933 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.3.2 Reporter: Alex Hagerman Fix For: 2.3.2 ptats.Stats() should be pstats.Stats() in https://spark.apache.org/docs/latest/configuration.html for spark.python.profile.dump. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods
[ https://issues.apache.org/jira/browse/ARROW-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman reassigned ARROW-2600: Assignee: (was: Alex Hagerman) > [Python] Add additional LocalFileSystem filesystem methods > -- > > Key: ARROW-2600 > URL: https://issues.apache.org/jira/browse/ARROW-2600 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alex Hagerman >Priority: Minor > Labels: filesystem, pull-request-available > Fix For: 0.12.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the > methods Martin listed are also not part of the LocalFileSystem class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them
[ https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2760: - Component/s: Python > [Python] Remove legacy property definition syntax from parquet module and > test them > --- > > Key: ARROW-2760 > URL: https://issues.apache.org/jira/browse/ARROW-2760 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-955) [Docs] Guide for building Python from source on Ubuntu 14.04 LTS without conda
[ https://issues.apache.org/jira/browse/ARROW-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538419#comment-16538419 ] Alex Hagerman commented on ARROW-955: - Does this still need to happen with the updated dev docs? I know they are 16.04, but 14.04 is in maintenance and EOLs Q1 next year. Would it be better to validate builds on 18.04 the new LTS? https://www.ubuntu.com/info/release-end-of-life https://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos > [Docs] Guide for building Python from source on Ubuntu 14.04 LTS without conda > -- > > Key: ARROW-955 > URL: https://issues.apache.org/jira/browse/ARROW-955 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Environment: Ubuntu - 3.19.0-80-generic #88~14.04.1-Ubuntu > Python 2.7.6 >Reporter: Devang Shah >Priority: Major > > I built pyarrow, arrow, and parquet-cpp from source - so that I could use the > new read_row_group() interface and in general, have access to the latest > versions. I ran into many issues during the build but was ultimately > successful (notes below). However, I am not able to import pyarrow.parquet > due to the following issue: > >>import pyarrow.parquet > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/init.py", line 28, in > import pyarrow._config > ImportError: No module named _config > This is similar to an issue reported in github/conda-forge/pyarrow-feedstock, > where also I posted this...but I think this forum is more direct and > appropriate - so re-posting here. > I used instructions at https://arrow.apache.org/docs/python/install.html to > build arrow/cpp, parquet-cpp, and then pyarrow, with the following deviations > (I view them as possibly bugs in the instructions): > arrow/cpp build: > export ARROW_HOME=$HOME/local > I had to specify -DARROW_PYTHON=on and -DPARQUET_ARROW=ON to the cmake > command (besides the -DCMAKE_INSTALL_PREFIX=$ARROW_HOME) > parquet-cpp build: > export ARROW_HOME=$HOME/local > cmake -DARROW_HOME=$HOME/local -DPARQUET_ARROW_LINKAGE=static > -DPARQUET_ARROW=ON . > make > sudo make install > this installs parquet libs in the std systems > location (/usr/local/lib) so that the pyarrow build (see below) can find the > parquet libs > pyarrow build: > export ARROW_HOME=$HOME/local (not a deviation; just repeating here) > export LD_LIBRARY_PATH=$HOME/local/lib:$HOME/parquet4/parquet-cpp/build/latest > sudo python setup.py build_ext --with-parquet --with-jemalloc > --build-type=release install > sudo python setup.py install > (sudo is needed to install in /usr/local/lib/python2.7/dist-packages ) > These are the steps and modifications to the instructions needed for me to > build the pyarrow.parquet package. However, when I now try to import the > package I get the error specified above. > Maybe I did something wrong in my steps which I kind of put together by > searching for these issues...but really can't tell what. It took me almost a > whole day to get to the point where I can build pyarrow and parquet, and now > I can't use what I built. > Any comments, help appreciated! Thanks in advance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2586) Make child builders of ListBuilder and StructBuilder shared_ptr's
[ https://issues.apache.org/jira/browse/ARROW-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2586: - Component/s: C++ > Make child builders of ListBuilder and StructBuilder shared_ptr's > - > > Key: ARROW-2586 > URL: https://issues.apache.org/jira/browse/ARROW-2586 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joshua Storck >Assignee: Joshua Storck >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > This is needed for changes in this PR that make it possible to deserialize > arbitrary nested structures in parquet (ARROW-1644): > https://github.com/apache/parquet-cpp/pull/462 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2658) [Python] Serialize and Deserialize Table objects
[ https://issues.apache.org/jira/browse/ARROW-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2658: - Summary: [Python] Serialize and Deserialize Table objects (was: Serialize and Deserialize Table objects) > [Python] Serialize and Deserialize Table objects > > > Key: ARROW-2658 > URL: https://issues.apache.org/jira/browse/ARROW-2658 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Kunal Gosar >Priority: Major > > Add support for serializing and deserializing pyarrow Tables. This would > allow using Table objects in plasma and DataFrames can be converted to a > Table object as intermediary for serialization. Currently I see the following > when trying this operation: > {code:java} > In [36]: pa.serialize(t) > --- > SerializationCallbackError Traceback (most recent call last) > in () > > 1 pa.serialize(t) > ~/dev/arrow/python/pyarrow/serialization.pxi in pyarrow.lib.serialize() > 336 > 337 with nogil: > --> 338 check_status(SerializeObject(context, wrapped_value, > &serialized.data)) > 339 return serialized > 340 > ~/dev/arrow/python/pyarrow/serialization.pxi in > pyarrow.lib.SerializationContext._serialize_callback() > 134 > 135 if not found: > --> 136 raise SerializationCallbackError( > 137 "pyarrow does not know how to " > 138 "serialize objects of type {}.".format(type(obj)), obj > SerializationCallbackError: pyarrow does not know how to serialize objects of > type . > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2658) Serialize and Deserialize Table objects
[ https://issues.apache.org/jira/browse/ARROW-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2658: - Component/s: Python > Serialize and Deserialize Table objects > --- > > Key: ARROW-2658 > URL: https://issues.apache.org/jira/browse/ARROW-2658 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Kunal Gosar >Priority: Major > > Add support for serializing and deserializing pyarrow Tables. This would > allow using Table objects in plasma and DataFrames can be converted to a > Table object as intermediary for serialization. Currently I see the following > when trying this operation: > {code:java} > In [36]: pa.serialize(t) > --- > SerializationCallbackError Traceback (most recent call last) > in () > > 1 pa.serialize(t) > ~/dev/arrow/python/pyarrow/serialization.pxi in pyarrow.lib.serialize() > 336 > 337 with nogil: > --> 338 check_status(SerializeObject(context, wrapped_value, > &serialized.data)) > 339 return serialized > 340 > ~/dev/arrow/python/pyarrow/serialization.pxi in > pyarrow.lib.SerializationContext._serialize_callback() > 134 > 135 if not found: > --> 136 raise SerializationCallbackError( > 137 "pyarrow does not know how to " > 138 "serialize objects of type {}.".format(type(obj)), obj > SerializationCallbackError: pyarrow does not know how to serialize objects of > type . > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2710) pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing
[ https://issues.apache.org/jira/browse/ARROW-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2710: - Component/s: Python > pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing > --- > > Key: ARROW-2710 > URL: https://issues.apache.org/jira/browse/ARROW-2710 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Tested on several Linux OSs. >Reporter: Michael Andrews >Priority: Major > > Unable to open a parquet file via {{pq.ParquetFile(filename)}} when called > using the PyTorch DataLoader in multiprocessing mode. Affects versions > pyarrow > 0.7.1. > As detailed in [https://github.com/apache/arrow/issues/1946]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2710) [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing
[ https://issues.apache.org/jira/browse/ARROW-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2710: - Summary: [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing (was: pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing) > [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in > multiprocessing > > > Key: ARROW-2710 > URL: https://issues.apache.org/jira/browse/ARROW-2710 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Tested on several Linux OSs. >Reporter: Michael Andrews >Priority: Major > > Unable to open a parquet file via {{pq.ParquetFile(filename)}} when called > using the PyTorch DataLoader in multiprocessing mode. Affects versions > pyarrow > 0.7.1. > As detailed in [https://github.com/apache/arrow/issues/1946]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2787) [Python] Memory Issue passing table from python to c++ via cython
[ https://issues.apache.org/jira/browse/ARROW-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2787: - Summary: [Python] Memory Issue passing table from python to c++ via cython (was: Memory Issue passing table from python to c++ via cython) > [Python] Memory Issue passing table from python to c++ via cython > - > > Key: ARROW-2787 > URL: https://issues.apache.org/jira/browse/ARROW-2787 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Python >Affects Versions: 0.9.0 > Environment: clang6 >Reporter: Joseph Toth >Priority: Major > Labels: cython > > I wanted to create a simple example of reading a table in Python and pass it > to C+, but I'm doing something wrong or there is a memory issue. When the > table gets to C+ and I print out column names it also prints out a lot of > junk and what looks like pydocs. Let me know if you need any more info. > Thanks! > > *demo.py* > import numpy > from psy.automl import cyth > import pandas as pd > from absl import app > def main(argv): > sup = pd.DataFrame({ > 'int': [1, 2], > 'str': ['a', 'b'] > }) > table = pa.Table.from_pandas(sup) > cyth.c_t(table) > *cyth.pyx* > import pandas as pd > import pyarrow as pa > from pyarrow.lib cimport * > cdef extern from "cyth.h" namespace "psy": > void t(shared_ptr[CTable]) > def c_t(obj): > # These print work > # for i in range(obj.num_columns): > # print(obj.column(i).name > cdef shared_ptr[CTable] tbl = pyarrow_unwrap_table(obj) > t(tbl) > *cyth.h* > #include > #include > #include "arrow/api.h" > #include "arrow/python/api.h" > #include "Python.h" > namespace psy { > void t(std::shared_ptr pytable) { > // This works > std::cout << "NUM" << pytable->num_columns(); > // This prints a lot of garbage > for(int i = 0; i < pytable->num_columns(); i++) { > std::cout << pytable->column(i)->name(); > } > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2787) Memory Issue passing table from python to c++ via cython
[ https://issues.apache.org/jira/browse/ARROW-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2787: - Labels: cython (was: ) > Memory Issue passing table from python to c++ via cython > > > Key: ARROW-2787 > URL: https://issues.apache.org/jira/browse/ARROW-2787 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Python >Affects Versions: 0.9.0 > Environment: clang6 >Reporter: Joseph Toth >Priority: Major > Labels: cython > > I wanted to create a simple example of reading a table in Python and pass it > to C+, but I'm doing something wrong or there is a memory issue. When the > table gets to C+ and I print out column names it also prints out a lot of > junk and what looks like pydocs. Let me know if you need any more info. > Thanks! > > *demo.py* > import numpy > from psy.automl import cyth > import pandas as pd > from absl import app > def main(argv): > sup = pd.DataFrame({ > 'int': [1, 2], > 'str': ['a', 'b'] > }) > table = pa.Table.from_pandas(sup) > cyth.c_t(table) > *cyth.pyx* > import pandas as pd > import pyarrow as pa > from pyarrow.lib cimport * > cdef extern from "cyth.h" namespace "psy": > void t(shared_ptr[CTable]) > def c_t(obj): > # These print work > # for i in range(obj.num_columns): > # print(obj.column(i).name > cdef shared_ptr[CTable] tbl = pyarrow_unwrap_table(obj) > t(tbl) > *cyth.h* > #include > #include > #include "arrow/api.h" > #include "arrow/python/api.h" > #include "Python.h" > namespace psy { > void t(std::shared_ptr pytable) { > // This works > std::cout << "NUM" << pytable->num_columns(); > // This prints a lot of garbage > for(int i = 0; i < pytable->num_columns(); i++) { > std::cout << pytable->column(i)->name(); > } > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2787) Memory Issue passing table from python to c++ via cython
[ https://issues.apache.org/jira/browse/ARROW-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2787: - Component/s: Python > Memory Issue passing table from python to c++ via cython > > > Key: ARROW-2787 > URL: https://issues.apache.org/jira/browse/ARROW-2787 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Python >Affects Versions: 0.9.0 > Environment: clang6 >Reporter: Joseph Toth >Priority: Major > Labels: cython > > I wanted to create a simple example of reading a table in Python and pass it > to C+, but I'm doing something wrong or there is a memory issue. When the > table gets to C+ and I print out column names it also prints out a lot of > junk and what looks like pydocs. Let me know if you need any more info. > Thanks! > > *demo.py* > import numpy > from psy.automl import cyth > import pandas as pd > from absl import app > def main(argv): > sup = pd.DataFrame({ > 'int': [1, 2], > 'str': ['a', 'b'] > }) > table = pa.Table.from_pandas(sup) > cyth.c_t(table) > *cyth.pyx* > import pandas as pd > import pyarrow as pa > from pyarrow.lib cimport * > cdef extern from "cyth.h" namespace "psy": > void t(shared_ptr[CTable]) > def c_t(obj): > # These print work > # for i in range(obj.num_columns): > # print(obj.column(i).name > cdef shared_ptr[CTable] tbl = pyarrow_unwrap_table(obj) > t(tbl) > *cyth.h* > #include > #include > #include "arrow/api.h" > #include "arrow/python/api.h" > #include "Python.h" > namespace psy { > void t(std::shared_ptr pytable) { > // This works > std::cout << "NUM" << pytable->num_columns(); > // This prints a lot of garbage > for(int i = 0; i < pytable->num_columns(); i++) { > std::cout << pytable->column(i)->name(); > } > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2709) [Python] write_to_dataset poor performance when splitting
[ https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2709: - Summary: [Python] write_to_dataset poor performance when splitting (was: write_to_dataset poor performance when splitting) > [Python] write_to_dataset poor performance when splitting > - > > Key: ARROW-2709 > URL: https://issues.apache.org/jira/browse/ARROW-2709 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Olaf >Priority: Critical > Labels: parquet > > Hello, > Posting this from github (master [~wesmckinn] asked for it :) ) > https://github.com/apache/arrow/issues/2138 > > {code:java} > import pandas as pd import numpy as np import pyarrow.parquet as pq import > pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 > 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col' : > np.random.rand(len(idx)), 'string_col' : > pd.util.testing.rands_array(8,len(idx))}, index = idx){code} > > {code:java} > df["dt"] = df.index df["dt"] = df["dt"].dt.date table = > pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path='dataset_name', > partition_cols=['dt'], flavor='spark'){code} > > {{this works but is inefficient memory-wise. The arrow table is a copy of the > large pandas daframe and quickly saturates the RAM.}} > > {{Thanks!}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2274) [Python] ObjectID from string
[ https://issues.apache.org/jira/browse/ARROW-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2274: - Summary: [Python] ObjectID from string (was: ObjectID from string) > [Python] ObjectID from string > - > > Key: ARROW-2274 > URL: https://issues.apache.org/jira/browse/ARROW-2274 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Eric Feldman >Priority: Critical > > I want to have ObjectID from string. > The Problem is that if I'm creating new ObjectID from a string and inserting > value associated with that id, the next time I will generate ObjectID from > that string, the is different. > I'm looking for something like Key-Value store, is it possible? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2709) write_to_dataset poor performance when splitting
[ https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2709: - Labels: parquet (was: ) > write_to_dataset poor performance when splitting > > > Key: ARROW-2709 > URL: https://issues.apache.org/jira/browse/ARROW-2709 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Olaf >Priority: Critical > Labels: parquet > > Hello, > Posting this from github (master [~wesmckinn] asked for it :) ) > https://github.com/apache/arrow/issues/2138 > > {code:java} > import pandas as pd import numpy as np import pyarrow.parquet as pq import > pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 > 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col' : > np.random.rand(len(idx)), 'string_col' : > pd.util.testing.rands_array(8,len(idx))}, index = idx){code} > > {code:java} > df["dt"] = df.index df["dt"] = df["dt"].dt.date table = > pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path='dataset_name', > partition_cols=['dt'], flavor='spark'){code} > > {{this works but is inefficient memory-wise. The arrow table is a copy of the > large pandas daframe and quickly saturates the RAM.}} > > {{Thanks!}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2709) write_to_dataset poor performance when splitting
[ https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman updated ARROW-2709: - Component/s: Python > write_to_dataset poor performance when splitting > > > Key: ARROW-2709 > URL: https://issues.apache.org/jira/browse/ARROW-2709 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Olaf >Priority: Critical > > Hello, > Posting this from github (master [~wesmckinn] asked for it :) ) > https://github.com/apache/arrow/issues/2138 > > {code:java} > import pandas as pd import numpy as np import pyarrow.parquet as pq import > pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 > 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col' : > np.random.rand(len(idx)), 'string_col' : > pd.util.testing.rands_array(8,len(idx))}, index = idx){code} > > {code:java} > df["dt"] = df.index df["dt"] = df["dt"].dt.date table = > pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path='dataset_name', > partition_cols=['dt'], flavor='spark'){code} > > {{this works but is inefficient memory-wise. The arrow table is a copy of the > large pandas daframe and quickly saturates the RAM.}} > > {{Thanks!}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2601) [Python] MemoryPool bytes_allocated causes seg
Alex Hagerman created ARROW-2601: Summary: [Python] MemoryPool bytes_allocated causes seg Key: ARROW-2601 URL: https://issues.apache.org/jira/browse/ARROW-2601 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.9.0 Reporter: Alex Hagerman Fix For: 0.10.0 Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> mp = pa.MemoryPool() >>> arr = pa.array([1,2,3], memory_pool=mp) >>> mp.bytes_allocated() Segmentation fault (core dumped) I'll dig into this further, but should bytes_alloacted be returning anything when called like this? Or should it raise NotImplemented? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2601) [Python] MemoryPool bytes_allocated causes seg
Alex Hagerman created ARROW-2601: Summary: [Python] MemoryPool bytes_allocated causes seg Key: ARROW-2601 URL: https://issues.apache.org/jira/browse/ARROW-2601 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.9.0 Reporter: Alex Hagerman Fix For: 0.10.0 Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> mp = pa.MemoryPool() >>> arr = pa.array([1,2,3], memory_pool=mp) >>> mp.bytes_allocated() Segmentation fault (core dumped) I'll dig into this further, but should bytes_alloacted be returning anything when called like this? Or should it raise NotImplemented? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods
Alex Hagerman created ARROW-2600: Summary: [Python] Add additional LocalFileSystem filesystem methods Key: ARROW-2600 URL: https://issues.apache.org/jira/browse/ARROW-2600 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the methods Martin listed are also not part of the LocalFileSystem class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods
Alex Hagerman created ARROW-2600: Summary: [Python] Add additional LocalFileSystem filesystem methods Key: ARROW-2600 URL: https://issues.apache.org/jira/browse/ARROW-2600 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the methods Martin listed are also not part of the LocalFileSystem class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion
[ https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473253#comment-16473253 ] Alex Hagerman commented on ARROW-2428: -- [~xhochy] I was reading through the meta issue and trying to understand what we have to make sure to pass. Do you think this has settled enough to begin work? It appears pandas will expect a class defining the type, which I'm guessing the objects in the arrow column will be instances of that user type? Do we expect arrow columns to meet all the requirements of ExtensionArray? I was specifically looking at this to understand what options have to be passed and what the ExtensionArray requires. https://github.com/pandas-dev/pandas/pull/19174/files#diff-e448fe09dbe8aed468d89a4c90e65cff > [Python] Support ExtensionArrays in to_pandas conversion > > > Key: ARROW-2428 > URL: https://issues.apache.org/jira/browse/ARROW-2428 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > With the next release of Pandas, it will be possible to define custom column > types that back a {{pandas.Series}}. Thus we will not be able to cover all > possible column types in the {{to_pandas}} conversion by default as we won't > be aware of all extension arrays. > To enable users to create {{ExtensionArray}} instances from Arrow columns in > the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} > call where they can overload the default conversion routines with the ones > that produce their {{ExtensionArray}} instances. > This should avoid additional copies in the case where we would nowadays first > convert the Arrow column into a default Pandas column (probably of object > type) and the user would afterwards convert it to a more efficient > {{ExtensionArray}}. This hook here will be especially useful when you build > {{ExtensionArrays}} where the storage is backed by Arrow. > The meta-issue that tracks the implementation inside of Pandas is: > https://github.com/pandas-dev/pandas/issues/19696 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1964) [Python] Expose Builder classes
[ https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman reassigned ARROW-1964: Assignee: (was: Alex Hagerman) > [Python] Expose Builder classes > --- > > Key: ARROW-1964 > URL: https://issues.apache.org/jira/browse/ARROW-1964 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Having the builder classes available from Python would be very helpful. > Currently a construction of an Arrow array always need to have a Python list > or numpy array as intermediate. As the builder in combination with jemalloc > are very efficient in building up non-chunked memory, it would be nice to > directly use them in certain cases. > The most useful builders are the > [StringBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L714] > and > [DictionaryBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L872] > as they provide functionality to create columns that are not easily > constructed using NumPy methods in Python. > The basic approach would be to wrap the C++ classes in > https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd > so that they can be used from Cython. Afterwards, we should start a new file > {{python/pyarrow/builder.pxi}} where we have classes take typical Python > objects like {{str}} and pass them on to the C++ classes. At the end, these > classes should also return (Python accessible) {{pyarrow.Array}} instances. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1964) [Python] Expose Builder classes
[ https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman reassigned ARROW-1964: Assignee: Alex Hagerman > [Python] Expose Builder classes > --- > > Key: ARROW-1964 > URL: https://issues.apache.org/jira/browse/ARROW-1964 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Assignee: Alex Hagerman >Priority: Major > Labels: beginner, pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Having the builder classes available from Python would be very helpful. > Currently a construction of an Arrow array always need to have a Python list > or numpy array as intermediate. As the builder in combination with jemalloc > are very efficient in building up non-chunked memory, it would be nice to > directly use them in certain cases. > The most useful builders are the > [StringBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L714] > and > [DictionaryBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L872] > as they provide functionality to create columns that are not easily > constructed using NumPy methods in Python. > The basic approach would be to wrap the C++ classes in > https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd > so that they can be used from Cython. Afterwards, we should start a new file > {{python/pyarrow/builder.pxi}} where we have classes take typical Python > objects like {{str}} and pass them on to the C++ classes. At the end, these > classes should also return (Python accessible) {{pyarrow.Array}} instances. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2339) [Python] Add a fast path for int hashing
[ https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437747#comment-16437747 ] Alex Hagerman commented on ARROW-2339: -- Good to know. I'll look at the open tickets and priority to see if there is something else to pick up. Also don't want to hold up things if I can't work on something for a few days. > [Python] Add a fast path for int hashing > > > Key: ARROW-2339 > URL: https://issues.apache.org/jira/browse/ARROW-2339 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alex Hagerman >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > Create a __hash__ fast path for Int scalars that avoids using as_py(). > > https://issues.apache.org/jira/browse/ARROW-640 > [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2339) [Python] Add a fast path for int hashing
[ https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437729#comment-16437729 ] Alex Hagerman commented on ARROW-2339: -- That will be interesting! Got it. Thank you for the direction. > [Python] Add a fast path for int hashing > > > Key: ARROW-2339 > URL: https://issues.apache.org/jira/browse/ARROW-2339 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alex Hagerman >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > Create a __hash__ fast path for Int scalars that avoids using as_py(). > > https://issues.apache.org/jira/browse/ARROW-640 > [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2339) [Python] Add a fast path for int hashing
[ https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437664#comment-16437664 ] Alex Hagerman commented on ARROW-2339: -- [~pitrou] [~wesmckinn] sorry I've been absent on this work has had me tied up day and night hoping to work some more on this over the weekend. I was wondering if you had any thoughts on using xxHash, MumrurHash or FNV-1a for this? I was going to do some timing this weekend as well as testing for collisions on various ints as you mentioned on the original ticket. Do you know if we can use existing implementations of the hash from C or C++ with wrappers? I didn't know what ASF rules might be on that with regard to licenses (only ASF or MIT/BSD allowed) and adding the Cython wrappers to PyArrow. If it's better just to do a new implementation I'll work on that too, but didn't want to reinvent a wheel if I didn't need to. > [Python] Add a fast path for int hashing > > > Key: ARROW-2339 > URL: https://issues.apache.org/jira/browse/ARROW-2339 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Alex Hagerman >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > Create a __hash__ fast path for Int scalars that avoids using as_py(). > > https://issues.apache.org/jira/browse/ARROW-640 > [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2395) [Python] Correct flake8 errors outside of benchmarks
Alex Hagerman created ARROW-2395: Summary: [Python] Correct flake8 errors outside of benchmarks Key: ARROW-2395 URL: https://issues.apache.org/jira/browse/ARROW-2395 Project: Apache Arrow Issue Type: Improvement Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Fix flake8 warnings for files outside of benchmarks directory. !https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2395) [Python] Correct flake8 errors outside of benchmarks
Alex Hagerman created ARROW-2395: Summary: [Python] Correct flake8 errors outside of benchmarks Key: ARROW-2395 URL: https://issues.apache.org/jira/browse/ARROW-2395 Project: Apache Arrow Issue Type: Improvement Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Fix flake8 warnings for files outside of benchmarks directory. !https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2394) [Python] Correct flake8 errors in benchmarks
Alex Hagerman created ARROW-2394: Summary: [Python] Correct flake8 errors in benchmarks Key: ARROW-2394 URL: https://issues.apache.org/jira/browse/ARROW-2394 Project: Apache Arrow Issue Type: Improvement Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Fix linting issues that that flake8 can be ran for all files in the Python directory. !https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2394) [Python] Correct flake8 errors in benchmarks
Alex Hagerman created ARROW-2394: Summary: [Python] Correct flake8 errors in benchmarks Key: ARROW-2394 URL: https://issues.apache.org/jira/browse/ARROW-2394 Project: Apache Arrow Issue Type: Improvement Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Fix linting issues that that flake8 can be ran for all files in the Python directory. !https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2325) [Python] Update setup.py to use Markdown project description
[ https://issues.apache.org/jira/browse/ARROW-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman reassigned ARROW-2325: Assignee: Alex Hagerman > [Python] Update setup.py to use Markdown project description > > > Key: ARROW-2325 > URL: https://issues.apache.org/jira/browse/ARROW-2325 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > New stuff in PyPI > https://dustingram.com/articles/2018/03/16/markdown-descriptions-on-pypi -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2339) [Python] Add a fast path for int hashing
Alex Hagerman created ARROW-2339: Summary: [Python] Add a fast path for int hashing Key: ARROW-2339 URL: https://issues.apache.org/jira/browse/ARROW-2339 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Create a __hash__ fast path for Int scalars that avoids using as_py(). https://issues.apache.org/jira/browse/ARROW-640 [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2339) [Python] Add a fast path for int hashing
Alex Hagerman created ARROW-2339: Summary: [Python] Add a fast path for int hashing Key: ARROW-2339 URL: https://issues.apache.org/jira/browse/ARROW-2339 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alex Hagerman Assignee: Alex Hagerman Fix For: 0.10.0 Create a __hash__ fast path for Int scalars that avoids using as_py(). https://issues.apache.org/jira/browse/ARROW-640 [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404189#comment-16404189 ] Alex Hagerman commented on ARROW-640: - I've added the __hash__ for ints and opened a PR. __eq__ was already in place using as_py() in relation to the original ticket. Happy to look into the other types and explore different ways to handle hashing them as well as any extension of as_py that might be needed if some direction or new tickets could be provided. Otherwise I'll look at what else is open that I might be able to help with. Timing information is below. import pyarrow as pa arr = pa.array([1,1,2,1]) a = arr[0] %timeit a.__hash__() 265 ns ± 1.72 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399783#comment-16399783 ] Alex Hagerman commented on ARROW-640: - Sounds good. Just to verify Integer only or Number types in general? I've got a deployment happening during the day right now, so I'll hopefully be able to wrap up a version one this weekend and do a PR for review. You mentioned for items like StructValue the as_py fallback won't work. Similarly with ListValue I would expect both of these to raise a TypeError: Unhashable Type, but I'll check the current behavior. Depending on what that is do you have any thoughts if the hash() TypeError should be raised on mutable types like standard python behavior? Wanted to check so I don't conflict with existing expected behavior if this has been handled previously and to look at tying it in with __eq__. > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397909#comment-16397909 ] Alex Hagerman commented on ARROW-640: - Thanks [~pitrou] this was actually what I had implemented locally so glad to see I was on the right track. Tonight I was working on doing a little bit of benchmarking and writing the tests. Any specific loads or types you might want to see related to the speed concern? Or is it better to get a consistent hash implementation like this setup in a PR and then worry about speed? > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627 ] Alex Hagerman edited comment on ARROW-640 at 3/11/18 9:02 PM: -- I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. I'm going to look at the history of __eq__ on ArrayValue and as_py then work on what would make sense for __hash__. {code:java} %load_ext Cython import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr [ 1, 1, 1, 2 ] arr[0] == arr[1] True arr[0] == arr[3] False word_list = ['test', 'not the same', 'test', 'nope'] word_list[0] == word_list[2] True word_list[0] == word_list[1] False pa.array.__eq__ set(arr) --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' arr_list = pa.from_pylist([1, 1, 1, 2]) --- AttributeErrorTraceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' {code} was (Author: alexhagerman): I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. I'm going to look at the history or __eq__ on the ScalarValue and as_py then work on what would make sense for __hash__. {code:java} %load_ext Cython import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr [ 1, 1, 1, 2 ] arr[0] == arr[1] True arr[0] == arr[3] False word_list = ['test', 'not the same', 'test', 'nope'] word_list[0] == word_list[2] True word_list[0] == word_list[1] False pa.array.__eq__ set(arr) --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' arr_list = pa.from_pylist([1, 1, 1, 2]) --- AttributeErrorTraceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' {code} > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627 ] Alex Hagerman edited comment on ARROW-640 at 3/11/18 9:01 PM: -- I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. I'm going to look at the history or __eq__ on the ScalarValue and as_py then work on what would make sense for __hash__. {code:java} %load_ext Cython import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr [ 1, 1, 1, 2 ] arr[0] == arr[1] True arr[0] == arr[3] False word_list = ['test', 'not the same', 'test', 'nope'] word_list[0] == word_list[2] True word_list[0] == word_list[1] False pa.array.__eq__ set(arr) --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' arr_list = pa.from_pylist([1, 1, 1, 2]) --- AttributeErrorTraceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' {code} was (Author: alexhagerman): I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. {code:java} %load_ext Cython import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr [ 1, 1, 1, 2 ] arr[0] == arr[1] True arr[0] == arr[3] False word_list = ['test', 'not the same', 'test', 'nope'] word_list[0] == word_list[2] True word_list[0] == word_list[1] False pa.array.__eq__ set(arr) --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' arr_list = pa.from_pylist([1, 1, 1, 2]) --- AttributeErrorTraceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' {code} > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627 ] Alex Hagerman edited comment on ARROW-640 at 3/11/18 8:16 PM: -- I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. {code:java} %load_ext Cython import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr [ 1, 1, 1, 2 ] arr[0] == arr[1] True arr[0] == arr[3] False word_list = ['test', 'not the same', 'test', 'nope'] word_list[0] == word_list[2] True word_list[0] == word_list[1] False pa.array.__eq__ set(arr) --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' arr_list = pa.from_pylist([1, 1, 1, 2]) --- AttributeErrorTraceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' {code} was (Author: alexhagerman): I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627 ] Alex Hagerman commented on ARROW-640: - I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. ```python %load_ext Cython ``` ```python import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr ``` [ 1, 1, 1, 2 ] ```python arr[0] == arr[1] ``` True ```python set(arr) ``` --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' ```python arr_list = pa.from_pylist([1, 1, 1, 2]) ``` --- AttributeError Traceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627 ] Alex Hagerman edited comment on ARROW-640 at 3/11/18 8:13 PM: -- I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. was (Author: alexhagerman): I think this has changed since the original ticket. The comparison appears to be working. Tested this with string and numbers. Also getting an error on set now. Going to continue looking into this, but if anybody has thoughts on this I'd be happy to hear them. Also from_pylist appears to have been removed, but I didn't find it searching the change log on github only an addition in 0.3. ```python %load_ext Cython ``` ```python import pyarrow as pa pylist = [1,1,1,2] arr = pa.array(pylist) arr ``` [ 1, 1, 1, 2 ] ```python arr[0] == arr[1] ``` True ```python set(arr) ``` --- TypeError Traceback (most recent call last) in () > 1 set(arr) TypeError: unhashable type: 'pyarrow.lib.Int64Value' ```python arr_list = pa.from_pylist([1, 1, 1, 2]) ``` --- AttributeError Traceback (most recent call last) in () > 1 arr_list = pa.from_pylist([1, 1, 1, 2]) AttributeError: module 'pyarrow' has no attribute 'from_pylist' > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison
[ https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hagerman reassigned ARROW-640: --- Assignee: Alex Hagerman > [Python] Arrow scalar values should have a sensible __hash__ and comparison > --- > > Key: ARROW-640 > URL: https://issues.apache.org/jira/browse/ARROW-640 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Miki Tebeka >Assignee: Alex Hagerman >Priority: Major > Fix For: 0.10.0 > > > {noformat} > In [86]: arr = pa.from_pylist([1, 1, 1, 2]) > In [87]: set(arr) > Out[87]: {1, 2, 1, 1} > In [88]: arr[0] == arr[1] > Out[88]: False > In [89]: arr > Out[89]: > > [ > 1, > 1, > 1, > 2 > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1391) [Python] Benchmarks for python serialization
[ https://issues.apache.org/jira/browse/ARROW-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383033#comment-16383033 ] Alex Hagerman commented on ARROW-1391: -- I see recent commits in the repo for the benchmarks. Is this still needed? If so any guidance on where the nightly location might be or how to look into this? > [Python] Benchmarks for python serialization > > > Key: ARROW-1391 > URL: https://issues.apache.org/jira/browse/ARROW-1391 > Project: Apache Arrow > Issue Type: Wish >Reporter: Philipp Moritz >Priority: Minor > > It would be great to have a suite of relevant benchmarks for the Python > serialization code in ARROW-759. These could be used to guide profiling and > performance improvements. > Relevant use cases include: > - dictionaries of large numpy arrays that are used to represent weights of a > neural network > - long lists of primitive types like ints, floats or strings > - lists of user defined python objects -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data
[ https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382733#comment-16382733 ] Alex Hagerman commented on ARROW-2242: -- I think these may be related? https://github.com/apache/arrow/issues/1677 > [Python] ParquetFile.read does not accommodate large binary data > - > > Key: ARROW-2242 > URL: https://issues.apache.org/jira/browse/ARROW-2242 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Chris Ellison >Priority: Major > Fix For: 0.9.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)