from:"\"Alex Hagerman \\\\\\\(JIRA\\\\\\\)\""

[jira] [Commented] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Alex Hagerman (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674043#comment-16674043
 ] 

Alex Hagerman commented on SPARK-25933:
---

https://github.com/apache/spark/pull/22933

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated SPARK-25933:
--
Labels: documentation pull-request-available  (was: documentation)

> Fix pstats reference for spark.python.profile.dump in configuration.md
> --
>
> Key: SPARK-25933
> URL: https://issues.apache.org/jira/browse/SPARK-25933
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.2
>Reporter: Alex Hagerman
>Priority: Trivial
>  Labels: documentation, pull-request-available
> Fix For: 2.3.2
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> ptats.Stats() should be pstats.Stats() in 
> https://spark.apache.org/docs/latest/configuration.html for 
> spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25933) Fix pstats reference for spark.python.profile.dump in configuration.md

2018-11-03 Thread Alex Hagerman (JIRA)

Alex Hagerman created SPARK-25933:
-

 Summary: Fix pstats reference for spark.python.profile.dump in 
configuration.md
 Key: SPARK-25933
 URL: https://issues.apache.org/jira/browse/SPARK-25933
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.3.2
Reporter: Alex Hagerman
 Fix For: 2.3.2


ptats.Stats() should be pstats.Stats() in 
https://spark.apache.org/docs/latest/configuration.html for 
spark.python.profile.dump.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods

2018-09-17 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman reassigned ARROW-2600:


Assignee: (was: Alex Hagerman)

> [Python] Add additional LocalFileSystem filesystem methods
> --
>
> Key: ARROW-2600
> URL: https://issues.apache.org/jira/browse/ARROW-2600
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Priority: Minor
>  Labels: filesystem, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the 
> methods Martin listed are also not part of the LocalFileSystem class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2760) [Python] Remove legacy property definition syntax from parquet module and test them

2018-07-12 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2760:
-
Component/s: Python

> [Python] Remove legacy property definition syntax from parquet module and 
> test them
> ---
>
> Key: ARROW-2760
> URL: https://issues.apache.org/jira/browse/ARROW-2760
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-955) [Docs] Guide for building Python from source on Ubuntu 14.04 LTS without conda

2018-07-10 Thread Alex Hagerman (JIRA)



[ 
https://issues.apache.org/jira/browse/ARROW-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538419#comment-16538419
 ] 

Alex Hagerman commented on ARROW-955:
-

Does this still need to happen with the updated dev docs? I know they are 
16.04, but 14.04 is in maintenance and EOLs Q1 next year. Would it be better to 
validate builds on 18.04 the new LTS?

https://www.ubuntu.com/info/release-end-of-life
https://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos

> [Docs] Guide for building Python from source on Ubuntu 14.04 LTS without conda
> --
>
> Key: ARROW-955
> URL: https://issues.apache.org/jira/browse/ARROW-955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
> Environment: Ubuntu - 3.19.0-80-generic #88~14.04.1-Ubuntu
> Python 2.7.6
>Reporter: Devang Shah
>Priority: Major
>
> I built pyarrow, arrow, and parquet-cpp from source - so that I could use the 
> new read_row_group() interface and in general, have access to the latest 
> versions. I ran into many issues during the build but was ultimately 
> successful (notes below). However, I am not able to import pyarrow.parquet 
> due to the following issue:
> >>import pyarrow.parquet
> Traceback (most recent call last):
> File "", line 1, in 
> File "pyarrow/init.py", line 28, in 
> import pyarrow._config
> ImportError: No module named _config
> This is similar to an issue reported in github/conda-forge/pyarrow-feedstock, 
> where also I posted this...but I think this forum is more direct and 
> appropriate - so re-posting here.
> I used instructions at https://arrow.apache.org/docs/python/install.html to 
> build arrow/cpp, parquet-cpp, and then pyarrow, with the following deviations 
> (I view them as possibly bugs in the instructions):
> arrow/cpp build:
> export ARROW_HOME=$HOME/local
> I had to specify -DARROW_PYTHON=on and -DPARQUET_ARROW=ON to the cmake 
> command (besides the -DCMAKE_INSTALL_PREFIX=$ARROW_HOME)
> parquet-cpp build:
> export ARROW_HOME=$HOME/local
> cmake -DARROW_HOME=$HOME/local -DPARQUET_ARROW_LINKAGE=static 
> -DPARQUET_ARROW=ON .
> make
> sudo make install > this installs parquet libs in the std systems 
> location (/usr/local/lib) so that the pyarrow build (see below) can find the 
> parquet libs
> pyarrow build:
> export ARROW_HOME=$HOME/local (not a deviation; just repeating here)
> export LD_LIBRARY_PATH=$HOME/local/lib:$HOME/parquet4/parquet-cpp/build/latest
> sudo python setup.py build_ext --with-parquet --with-jemalloc 
> --build-type=release install
> sudo python setup.py install
> (sudo is needed to install in /usr/local/lib/python2.7/dist-packages )
> These are the steps and modifications to the instructions needed for me to 
> build the pyarrow.parquet package. However, when I now try to import the 
> package I get the error specified above.
> Maybe I did something wrong in my steps which I kind of put together by 
> searching for these issues...but really can't tell what. It took me almost a 
> whole day to get to the point where I can build pyarrow and parquet, and now 
> I can't use what I built.
> Any comments, help appreciated! Thanks in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2586) Make child builders of ListBuilder and StructBuilder shared_ptr's

2018-07-10 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2586:
-
Component/s: C++

> Make child builders of ListBuilder and StructBuilder shared_ptr's
> -
>
> Key: ARROW-2586
> URL: https://issues.apache.org/jira/browse/ARROW-2586
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joshua Storck
>Assignee: Joshua Storck
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is needed for changes in this PR that make it possible to deserialize 
> arbitrary nested structures in parquet (ARROW-1644): 
> https://github.com/apache/parquet-cpp/pull/462 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2658) [Python] Serialize and Deserialize Table objects

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2658:
-
Summary: [Python] Serialize and Deserialize Table objects  (was: Serialize 
and Deserialize Table objects)

> [Python] Serialize and Deserialize Table objects
> 
>
> Key: ARROW-2658
> URL: https://issues.apache.org/jira/browse/ARROW-2658
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Kunal Gosar
>Priority: Major
>
> Add support for serializing and deserializing pyarrow Tables. This would 
> allow using Table objects in plasma and DataFrames can be converted to a 
> Table object as intermediary for serialization. Currently I see the following 
> when trying this operation:
> {code:java}
> In [36]: pa.serialize(t)
> ---
> SerializationCallbackError                Traceback (most recent call last)
>  in ()
> > 1 pa.serialize(t)
> ~/dev/arrow/python/pyarrow/serialization.pxi in pyarrow.lib.serialize()
>     336
>     337     with nogil:
> --> 338         check_status(SerializeObject(context, wrapped_value, 
> &serialized.data))
>     339     return serialized
>     340
> ~/dev/arrow/python/pyarrow/serialization.pxi in 
> pyarrow.lib.SerializationContext._serialize_callback()
>     134
>     135         if not found:
> --> 136             raise SerializationCallbackError(
>     137                 "pyarrow does not know how to "
>     138                 "serialize objects of type {}.".format(type(obj)), obj
> SerializationCallbackError: pyarrow does not know how to serialize objects of 
> type .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2658) Serialize and Deserialize Table objects

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2658:
-
Component/s: Python

> Serialize and Deserialize Table objects
> ---
>
> Key: ARROW-2658
> URL: https://issues.apache.org/jira/browse/ARROW-2658
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Kunal Gosar
>Priority: Major
>
> Add support for serializing and deserializing pyarrow Tables. This would 
> allow using Table objects in plasma and DataFrames can be converted to a 
> Table object as intermediary for serialization. Currently I see the following 
> when trying this operation:
> {code:java}
> In [36]: pa.serialize(t)
> ---
> SerializationCallbackError                Traceback (most recent call last)
>  in ()
> > 1 pa.serialize(t)
> ~/dev/arrow/python/pyarrow/serialization.pxi in pyarrow.lib.serialize()
>     336
>     337     with nogil:
> --> 338         check_status(SerializeObject(context, wrapped_value, 
> &serialized.data))
>     339     return serialized
>     340
> ~/dev/arrow/python/pyarrow/serialization.pxi in 
> pyarrow.lib.SerializationContext._serialize_callback()
>     134
>     135         if not found:
> --> 136             raise SerializationCallbackError(
>     137                 "pyarrow does not know how to "
>     138                 "serialize objects of type {}.".format(type(obj)), obj
> SerializationCallbackError: pyarrow does not know how to serialize objects of 
> type .
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2710) pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2710:
-
Component/s: Python

> pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing
> ---
>
> Key: ARROW-2710
> URL: https://issues.apache.org/jira/browse/ARROW-2710
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Tested on several Linux OSs.
>Reporter: Michael Andrews
>Priority: Major
>
> Unable to open a parquet file via {{pq.ParquetFile(filename)}} when called 
> using the PyTorch DataLoader in multiprocessing mode. Affects versions 
> pyarrow > 0.7.1.
> As detailed in [https://github.com/apache/arrow/issues/1946].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2710) [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2710:
-
Summary: [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader 
in multiprocessing  (was: pyarrow.lib.ArrowIOError when running PyTorch 
DataLoader in multiprocessing)

> [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in 
> multiprocessing
> 
>
> Key: ARROW-2710
> URL: https://issues.apache.org/jira/browse/ARROW-2710
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Tested on several Linux OSs.
>Reporter: Michael Andrews
>Priority: Major
>
> Unable to open a parquet file via {{pq.ParquetFile(filename)}} when called 
> using the PyTorch DataLoader in multiprocessing mode. Affects versions 
> pyarrow > 0.7.1.
> As detailed in [https://github.com/apache/arrow/issues/1946].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2787) [Python] Memory Issue passing table from python to c++ via cython

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2787:
-
Summary: [Python] Memory Issue passing table from python to c++ via cython  
(was: Memory Issue passing table from python to c++ via cython)

> [Python] Memory Issue passing table from python to c++ via cython
> -
>
> Key: ARROW-2787
> URL: https://issues.apache.org/jira/browse/ARROW-2787
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Python
>Affects Versions: 0.9.0
> Environment: clang6
>Reporter: Joseph Toth
>Priority: Major
>  Labels: cython
>
> I wanted to create a simple example of reading a table in Python and pass it 
> to C+, but I'm doing something wrong or there is a memory issue. When the 
> table gets to C+ and I print out column names it also prints out a lot of 
> junk and what looks like pydocs. Let me know if you need any more info. 
> Thanks!
>  
> *demo.py*
> import numpy
> from psy.automl import cyth
> import pandas as pd
> from absl import app
> def main(argv):
>   sup = pd.DataFrame({
>   'int': [1, 2],
>   'str': ['a', 'b']
>   })
>   table = pa.Table.from_pandas(sup)
>   cyth.c_t(table)
> *cyth.pyx*
> import pandas as pd
> import pyarrow as pa
> from pyarrow.lib cimport *
> cdef extern from "cyth.h" namespace "psy":
>  void t(shared_ptr[CTable])
> def c_t(obj):
>  # These print work
>  # for i in range(obj.num_columns):
>  # print(obj.column(i).name
>   cdef shared_ptr[CTable] tbl = pyarrow_unwrap_table(obj)
>   t(tbl)
>  *cyth.h*
> #include 
> #include 
> #include "arrow/api.h"
> #include "arrow/python/api.h"
> #include "Python.h"
> namespace psy {
> void t(std::shared_ptr pytable) {
> // This works
>   std::cout << "NUM" << pytable->num_columns();
> // This prints a lot of garbage
>   for(int i = 0; i < pytable->num_columns(); i++) {
>   std::cout << pytable->column(i)->name();
>   }
>  }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2787) Memory Issue passing table from python to c++ via cython

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2787:
-
Labels: cython  (was: )

> Memory Issue passing table from python to c++ via cython
> 
>
> Key: ARROW-2787
> URL: https://issues.apache.org/jira/browse/ARROW-2787
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Python
>Affects Versions: 0.9.0
> Environment: clang6
>Reporter: Joseph Toth
>Priority: Major
>  Labels: cython
>
> I wanted to create a simple example of reading a table in Python and pass it 
> to C+, but I'm doing something wrong or there is a memory issue. When the 
> table gets to C+ and I print out column names it also prints out a lot of 
> junk and what looks like pydocs. Let me know if you need any more info. 
> Thanks!
>  
> *demo.py*
> import numpy
> from psy.automl import cyth
> import pandas as pd
> from absl import app
> def main(argv):
>   sup = pd.DataFrame({
>   'int': [1, 2],
>   'str': ['a', 'b']
>   })
>   table = pa.Table.from_pandas(sup)
>   cyth.c_t(table)
> *cyth.pyx*
> import pandas as pd
> import pyarrow as pa
> from pyarrow.lib cimport *
> cdef extern from "cyth.h" namespace "psy":
>  void t(shared_ptr[CTable])
> def c_t(obj):
>  # These print work
>  # for i in range(obj.num_columns):
>  # print(obj.column(i).name
>   cdef shared_ptr[CTable] tbl = pyarrow_unwrap_table(obj)
>   t(tbl)
>  *cyth.h*
> #include 
> #include 
> #include "arrow/api.h"
> #include "arrow/python/api.h"
> #include "Python.h"
> namespace psy {
> void t(std::shared_ptr pytable) {
> // This works
>   std::cout << "NUM" << pytable->num_columns();
> // This prints a lot of garbage
>   for(int i = 0; i < pytable->num_columns(); i++) {
>   std::cout << pytable->column(i)->name();
>   }
>  }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2787) Memory Issue passing table from python to c++ via cython

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2787:
-
Component/s: Python

> Memory Issue passing table from python to c++ via cython
> 
>
> Key: ARROW-2787
> URL: https://issues.apache.org/jira/browse/ARROW-2787
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Python
>Affects Versions: 0.9.0
> Environment: clang6
>Reporter: Joseph Toth
>Priority: Major
>  Labels: cython
>
> I wanted to create a simple example of reading a table in Python and pass it 
> to C+, but I'm doing something wrong or there is a memory issue. When the 
> table gets to C+ and I print out column names it also prints out a lot of 
> junk and what looks like pydocs. Let me know if you need any more info. 
> Thanks!
>  
> *demo.py*
> import numpy
> from psy.automl import cyth
> import pandas as pd
> from absl import app
> def main(argv):
>   sup = pd.DataFrame({
>   'int': [1, 2],
>   'str': ['a', 'b']
>   })
>   table = pa.Table.from_pandas(sup)
>   cyth.c_t(table)
> *cyth.pyx*
> import pandas as pd
> import pyarrow as pa
> from pyarrow.lib cimport *
> cdef extern from "cyth.h" namespace "psy":
>  void t(shared_ptr[CTable])
> def c_t(obj):
>  # These print work
>  # for i in range(obj.num_columns):
>  # print(obj.column(i).name
>   cdef shared_ptr[CTable] tbl = pyarrow_unwrap_table(obj)
>   t(tbl)
>  *cyth.h*
> #include 
> #include 
> #include "arrow/api.h"
> #include "arrow/python/api.h"
> #include "Python.h"
> namespace psy {
> void t(std::shared_ptr pytable) {
> // This works
>   std::cout << "NUM" << pytable->num_columns();
> // This prints a lot of garbage
>   for(int i = 0; i < pytable->num_columns(); i++) {
>   std::cout << pytable->column(i)->name();
>   }
>  }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2709) [Python] write_to_dataset poor performance when splitting

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2709:
-
Summary: [Python] write_to_dataset poor performance when splitting  (was: 
write_to_dataset poor performance when splitting)

> [Python] write_to_dataset poor performance when splitting
> -
>
> Key: ARROW-2709
> URL: https://issues.apache.org/jira/browse/ARROW-2709
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Olaf
>Priority: Critical
>  Labels: parquet
>
> Hello,
> Posting this from github (master [~wesmckinn] asked for it :) )
> https://github.com/apache/arrow/issues/2138
>  
> {code:java}
> import pandas as pd import numpy as np import pyarrow.parquet as pq import 
> pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 
> 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col' : 
> np.random.rand(len(idx)), 'string_col' : 
> pd.util.testing.rands_array(8,len(idx))}, index = idx){code}
>  
> {code:java}
> df["dt"] = df.index df["dt"] = df["dt"].dt.date table = 
> pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path='dataset_name', 
> partition_cols=['dt'], flavor='spark'){code}
>  
> {{this works but is inefficient memory-wise. The arrow table is a copy of the 
> large pandas daframe and quickly saturates the RAM.}}
>  
> {{Thanks!}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2274) [Python] ObjectID from string

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2274:
-
Summary: [Python] ObjectID from string  (was: ObjectID from string)

> [Python] ObjectID from string
> -
>
> Key: ARROW-2274
> URL: https://issues.apache.org/jira/browse/ARROW-2274
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Eric Feldman
>Priority: Critical
>
> I want to have ObjectID from string.
> The Problem is that if I'm creating new ObjectID from a string and inserting 
> value associated with that id, the next time I will generate ObjectID from 
> that string, the is different.
> I'm looking for something like Key-Value store, is it possible?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2709) write_to_dataset poor performance when splitting

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2709:
-
Labels: parquet  (was: )

> write_to_dataset poor performance when splitting
> 
>
> Key: ARROW-2709
> URL: https://issues.apache.org/jira/browse/ARROW-2709
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Olaf
>Priority: Critical
>  Labels: parquet
>
> Hello,
> Posting this from github (master [~wesmckinn] asked for it :) )
> https://github.com/apache/arrow/issues/2138
>  
> {code:java}
> import pandas as pd import numpy as np import pyarrow.parquet as pq import 
> pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 
> 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col' : 
> np.random.rand(len(idx)), 'string_col' : 
> pd.util.testing.rands_array(8,len(idx))}, index = idx){code}
>  
> {code:java}
> df["dt"] = df.index df["dt"] = df["dt"].dt.date table = 
> pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path='dataset_name', 
> partition_cols=['dt'], flavor='spark'){code}
>  
> {{this works but is inefficient memory-wise. The arrow table is a copy of the 
> large pandas daframe and quickly saturates the RAM.}}
>  
> {{Thanks!}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2709) write_to_dataset poor performance when splitting

2018-07-09 Thread Alex Hagerman (JIRA)



 [ 
https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman updated ARROW-2709:
-
Component/s: Python

> write_to_dataset poor performance when splitting
> 
>
> Key: ARROW-2709
> URL: https://issues.apache.org/jira/browse/ARROW-2709
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Olaf
>Priority: Critical
>
> Hello,
> Posting this from github (master [~wesmckinn] asked for it :) )
> https://github.com/apache/arrow/issues/2138
>  
> {code:java}
> import pandas as pd import numpy as np import pyarrow.parquet as pq import 
> pyarrow as pa idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 
> 12:00:00.000', freq = 'T') dataframe = pd.DataFrame({'numeric_col' : 
> np.random.rand(len(idx)), 'string_col' : 
> pd.util.testing.rands_array(8,len(idx))}, index = idx){code}
>  
> {code:java}
> df["dt"] = df.index df["dt"] = df["dt"].dt.date table = 
> pa.Table.from_pandas(df) pq.write_to_dataset(table, root_path='dataset_name', 
> partition_cols=['dt'], flavor='spark'){code}
>  
> {{this works but is inefficient memory-wise. The arrow table is a copy of the 
> large pandas daframe and quickly saturates the RAM.}}
>  
> {{Thanks!}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2601) [Python] MemoryPool bytes_allocated causes seg

2018-05-17 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2601:


 Summary: [Python] MemoryPool bytes_allocated causes seg
 Key: ARROW-2601
 URL: https://issues.apache.org/jira/browse/ARROW-2601
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Alex Hagerman
 Fix For: 0.10.0


Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.


>>> import pyarrow as pa

>>> mp = pa.MemoryPool()
>>> arr = pa.array([1,2,3], memory_pool=mp)
>>> mp.bytes_allocated()

Segmentation fault (core dumped)

I'll dig into this further, but should bytes_alloacted be returning anything 
when called like this? Or should it raise NotImplemented?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2601) [Python] MemoryPool bytes_allocated causes seg

2018-05-17 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2601:


 Summary: [Python] MemoryPool bytes_allocated causes seg
 Key: ARROW-2601
 URL: https://issues.apache.org/jira/browse/ARROW-2601
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Alex Hagerman
 Fix For: 0.10.0


Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.


>>> import pyarrow as pa

>>> mp = pa.MemoryPool()
>>> arr = pa.array([1,2,3], memory_pool=mp)
>>> mp.bytes_allocated()

Segmentation fault (core dumped)

I'll dig into this further, but should bytes_alloacted be returning anything 
when called like this? Or should it raise NotImplemented?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods

2018-05-17 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2600:


 Summary: [Python] Add additional LocalFileSystem filesystem methods
 Key: ARROW-2600
 URL: https://issues.apache.org/jira/browse/ARROW-2600
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the 
methods Martin listed are also not part of the LocalFileSystem class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2600) [Python] Add additional LocalFileSystem filesystem methods

2018-05-17 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2600:


 Summary: [Python] Add additional LocalFileSystem filesystem methods
 Key: ARROW-2600
 URL: https://issues.apache.org/jira/browse/ARROW-2600
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Related to https://issues.apache.org/jira/browse/ARROW-1319 I noticed the 
methods Martin listed are also not part of the LocalFileSystem class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion

2018-05-12 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473253#comment-16473253
 ] 

Alex Hagerman commented on ARROW-2428:
--

[~xhochy] I was reading through the meta issue and trying to understand what we 
have to make sure to pass. Do you think this has settled enough to begin work? 
It appears pandas will expect a class defining the type, which I'm guessing the 
objects in the arrow column will be instances of that user type? Do we expect 
arrow columns to meet all the requirements of ExtensionArray?

 

I was specifically looking at this to understand what options have to be passed 
and what the ExtensionArray requires.

https://github.com/pandas-dev/pandas/pull/19174/files#diff-e448fe09dbe8aed468d89a4c90e65cff

> [Python] Support ExtensionArrays in to_pandas conversion
> 
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-1964) [Python] Expose Builder classes

2018-05-06 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman reassigned ARROW-1964:


Assignee: (was: Alex Hagerman)

> [Python] Expose Builder classes
> ---
>
> Key: ARROW-1964
> URL: https://issues.apache.org/jira/browse/ARROW-1964
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Having the builder classes available from Python would be very helpful. 
> Currently a construction of an Arrow array always need to have a Python list 
> or numpy array as intermediate. As  the builder in combination with jemalloc 
> are very efficient in building up non-chunked memory, it would be nice to 
> directly use them in certain cases.
> The most useful builders are the 
> [StringBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L714]
>  and 
> [DictionaryBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L872]
>  as they provide functionality to create columns that are not easily 
> constructed using NumPy methods in Python.
> The basic approach would be to wrap the C++ classes in 
> https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd
>  so that they can be used from Cython. Afterwards, we should start a new file 
> {{python/pyarrow/builder.pxi}} where we have classes take typical Python 
> objects like {{str}} and pass them on to the C++ classes. At the end, these 
> classes should also return (Python accessible) {{pyarrow.Array}} instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-1964) [Python] Expose Builder classes

2018-05-04 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman reassigned ARROW-1964:


Assignee: Alex Hagerman

> [Python] Expose Builder classes
> ---
>
> Key: ARROW-1964
> URL: https://issues.apache.org/jira/browse/ARROW-1964
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Alex Hagerman
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Having the builder classes available from Python would be very helpful. 
> Currently a construction of an Arrow array always need to have a Python list 
> or numpy array as intermediate. As  the builder in combination with jemalloc 
> are very efficient in building up non-chunked memory, it would be nice to 
> directly use them in certain cases.
> The most useful builders are the 
> [StringBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L714]
>  and 
> [DictionaryBuilder|https://github.com/apache/arrow/blob/5030e235047bdffabf6a900dd39b64eeeb96bdc8/cpp/src/arrow/builder.h#L872]
>  as they provide functionality to create columns that are not easily 
> constructed using NumPy methods in Python.
> The basic approach would be to wrap the C++ classes in 
> https://github.com/apache/arrow/blob/master/python/pyarrow/includes/libarrow.pxd
>  so that they can be used from Cython. Afterwards, we should start a new file 
> {{python/pyarrow/builder.pxi}} where we have classes take typical Python 
> objects like {{str}} and pass them on to the C++ classes. At the end, these 
> classes should also return (Python accessible) {{pyarrow.Array}} instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2339) [Python] Add a fast path for int hashing

2018-04-13 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437747#comment-16437747
 ] 

Alex Hagerman commented on ARROW-2339:
--

Good to know. I'll look at the open tickets and priority to see if there is 
something else to pick up. Also don't want to hold up things if I can't work on 
something for a few days.

 

> [Python] Add a fast path for int hashing
> 
>
> Key: ARROW-2339
> URL: https://issues.apache.org/jira/browse/ARROW-2339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> Create a __hash__ fast path for Int scalars that avoids using as_py().
>  
> https://issues.apache.org/jira/browse/ARROW-640
> [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2339) [Python] Add a fast path for int hashing

2018-04-13 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437729#comment-16437729
 ] 

Alex Hagerman commented on ARROW-2339:
--

That will be interesting! Got it. Thank you for the direction.

> [Python] Add a fast path for int hashing
> 
>
> Key: ARROW-2339
> URL: https://issues.apache.org/jira/browse/ARROW-2339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> Create a __hash__ fast path for Int scalars that avoids using as_py().
>  
> https://issues.apache.org/jira/browse/ARROW-640
> [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2339) [Python] Add a fast path for int hashing

2018-04-13 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437664#comment-16437664
 ] 

Alex Hagerman commented on ARROW-2339:
--

[~pitrou] [~wesmckinn] sorry I've been absent on this work has had me tied up 
day and night hoping to work some more on this over the weekend. I was 
wondering if you had any thoughts on using xxHash, MumrurHash or FNV-1a for 
this? I was going to do some timing this weekend as well as testing for 
collisions on various ints as you mentioned on the original ticket. Do you know 
if we can use existing implementations of the hash from C or C++ with wrappers? 
I didn't know what ASF rules might be on that with regard to licenses (only ASF 
or MIT/BSD allowed) and adding the Cython wrappers to PyArrow. If it's better 
just to do a new implementation I'll work on that too, but didn't want to 
reinvent a wheel if I didn't need to.

> [Python] Add a fast path for int hashing
> 
>
> Key: ARROW-2339
> URL: https://issues.apache.org/jira/browse/ARROW-2339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alex Hagerman
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> Create a __hash__ fast path for Int scalars that avoids using as_py().
>  
> https://issues.apache.org/jira/browse/ARROW-640
> [https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2395) [Python] Correct flake8 errors outside of benchmarks

2018-04-04 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2395:


 Summary: [Python] Correct flake8 errors outside of benchmarks
 Key: ARROW-2395
 URL: https://issues.apache.org/jira/browse/ARROW-2395
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Fix flake8 warnings for files outside of benchmarks directory.

 

!https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2395) [Python] Correct flake8 errors outside of benchmarks

2018-04-04 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2395:


 Summary: [Python] Correct flake8 errors outside of benchmarks
 Key: ARROW-2395
 URL: https://issues.apache.org/jira/browse/ARROW-2395
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Fix flake8 warnings for files outside of benchmarks directory.

 

!https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2394) [Python] Correct flake8 errors in benchmarks

2018-04-04 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2394:


 Summary: [Python] Correct flake8 errors in benchmarks
 Key: ARROW-2394
 URL: https://issues.apache.org/jira/browse/ARROW-2394
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Fix linting issues that that flake8 can be ran for all files in the Python 
directory.

 

!https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2394) [Python] Correct flake8 errors in benchmarks

2018-04-04 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2394:


 Summary: [Python] Correct flake8 errors in benchmarks
 Key: ARROW-2394
 URL: https://issues.apache.org/jira/browse/ARROW-2394
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Fix linting issues that that flake8 can be ran for all files in the Python 
directory.

 

!https://user-images.githubusercontent.com/2118138/38217076-f08a67da-369a-11e8-8166-b3a9ed7d9a60.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-2325) [Python] Update setup.py to use Markdown project description

2018-03-29 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman reassigned ARROW-2325:


Assignee: Alex Hagerman

> [Python] Update setup.py to use Markdown project description
> 
>
> Key: ARROW-2325
> URL: https://issues.apache.org/jira/browse/ARROW-2325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> New stuff in PyPI 
> https://dustingram.com/articles/2018/03/16/markdown-descriptions-on-pypi



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2339) [Python] Add a fast path for int hashing

2018-03-21 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2339:


 Summary: [Python] Add a fast path for int hashing
 Key: ARROW-2339
 URL: https://issues.apache.org/jira/browse/ARROW-2339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Create a __hash__ fast path for Int scalars that avoids using as_py().

 

https://issues.apache.org/jira/browse/ARROW-640

[https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2339) [Python] Add a fast path for int hashing

2018-03-21 Thread Alex Hagerman (JIRA)

Alex Hagerman created ARROW-2339:


 Summary: [Python] Add a fast path for int hashing
 Key: ARROW-2339
 URL: https://issues.apache.org/jira/browse/ARROW-2339
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alex Hagerman
Assignee: Alex Hagerman
 Fix For: 0.10.0


Create a __hash__ fast path for Int scalars that avoids using as_py().

 

https://issues.apache.org/jira/browse/ARROW-640

[https://github.com/apache/arrow/pull/1765/files/4497b69db8039cfeaa7a25f593f3a3e6c7984604]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-18 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404189#comment-16404189
 ] 

Alex Hagerman commented on ARROW-640:
-

I've added the __hash__ for ints and opened a PR. __eq__ was already in place 
using as_py() in relation to the original ticket. Happy to look into the other 
types and explore different ways to handle hashing them as well as any 
extension of as_py that might be needed if some direction or new tickets could 
be provided. Otherwise I'll look at what else is open that I might be able to 
help with.

Timing information is below.

import pyarrow as pa
arr = pa.array([1,1,2,1])
a = arr[0]
%timeit a.__hash__()
265 ns ± 1.72 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-14 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399783#comment-16399783
 ] 

Alex Hagerman commented on ARROW-640:
-

Sounds good. Just to verify Integer only or Number types in general? I've got a 
deployment happening during the day right now, so I'll hopefully be able to 
wrap up a version one this weekend and do a PR for review.

You mentioned for items like StructValue the as_py fallback won't work. 
Similarly with ListValue I would expect both of these to raise a TypeError: 
Unhashable Type, but I'll check the current behavior. Depending on what that is 
do you have any thoughts if the hash() TypeError should be raised on mutable 
types like standard python behavior? Wanted to check so I don't conflict with 
existing expected behavior if this has been handled previously and to look at 
tying it in with __eq__.

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-13 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397909#comment-16397909
 ] 

Alex Hagerman commented on ARROW-640:
-

Thanks [~pitrou] this was actually what I had implemented locally so glad to 
see I was on the right track. Tonight I was working on doing a little bit of 
benchmarking and writing the tests. Any specific loads or types you might want 
to see related to the speed concern? Or is it better to get a consistent hash 
implementation like this setup in a PR and then worry about speed?

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-11 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627
 ] 

Alex Hagerman edited comment on ARROW-640 at 3/11/18 9:02 PM:
--

I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3. I'm 
going to look at the history of __eq__ on ArrayValue and as_py then work on 
what would make sense for __hash__.
{code:java}
%load_ext Cython
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr

[
  1,
  1,
  1,
  2
]
arr[0] == arr[1]
True
arr[0] == arr[3]
False
word_list = ['test', 'not the same', 'test', 'nope']
word_list[0] == word_list[2]
True
word_list[0] == word_list[1]
False
pa.array.__eq__

set(arr)
---
TypeError Traceback (most recent call last)
 in ()
> 1 set(arr)

TypeError: unhashable type: 'pyarrow.lib.Int64Value'
arr_list = pa.from_pylist([1, 1, 1, 2])
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 arr_list = pa.from_pylist([1, 1, 1, 2])

AttributeError: module 'pyarrow' has no attribute 'from_pylist'
{code}
 


was (Author: alexhagerman):
I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3. I'm 
going to look at the history or __eq__ on the ScalarValue and as_py then work 
on what would make sense for __hash__.
{code:java}
%load_ext Cython
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr

[
  1,
  1,
  1,
  2
]
arr[0] == arr[1]
True
arr[0] == arr[3]
False
word_list = ['test', 'not the same', 'test', 'nope']
word_list[0] == word_list[2]
True
word_list[0] == word_list[1]
False
pa.array.__eq__

set(arr)
---
TypeError Traceback (most recent call last)
 in ()
> 1 set(arr)

TypeError: unhashable type: 'pyarrow.lib.Int64Value'
arr_list = pa.from_pylist([1, 1, 1, 2])
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 arr_list = pa.from_pylist([1, 1, 1, 2])

AttributeError: module 'pyarrow' has no attribute 'from_pylist'
{code}
 

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-11 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627
 ] 

Alex Hagerman edited comment on ARROW-640 at 3/11/18 9:01 PM:
--

I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3. I'm 
going to look at the history or __eq__ on the ScalarValue and as_py then work 
on what would make sense for __hash__.
{code:java}
%load_ext Cython
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr

[
  1,
  1,
  1,
  2
]
arr[0] == arr[1]
True
arr[0] == arr[3]
False
word_list = ['test', 'not the same', 'test', 'nope']
word_list[0] == word_list[2]
True
word_list[0] == word_list[1]
False
pa.array.__eq__

set(arr)
---
TypeError Traceback (most recent call last)
 in ()
> 1 set(arr)

TypeError: unhashable type: 'pyarrow.lib.Int64Value'
arr_list = pa.from_pylist([1, 1, 1, 2])
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 arr_list = pa.from_pylist([1, 1, 1, 2])

AttributeError: module 'pyarrow' has no attribute 'from_pylist'
{code}
 


was (Author: alexhagerman):
I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3.
{code:java}
%load_ext Cython
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr

[
  1,
  1,
  1,
  2
]
arr[0] == arr[1]
True
arr[0] == arr[3]
False
word_list = ['test', 'not the same', 'test', 'nope']
word_list[0] == word_list[2]
True
word_list[0] == word_list[1]
False
pa.array.__eq__

set(arr)
---
TypeError Traceback (most recent call last)
 in ()
> 1 set(arr)

TypeError: unhashable type: 'pyarrow.lib.Int64Value'
arr_list = pa.from_pylist([1, 1, 1, 2])
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 arr_list = pa.from_pylist([1, 1, 1, 2])

AttributeError: module 'pyarrow' has no attribute 'from_pylist'
{code}
 

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-11 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627
 ] 

Alex Hagerman edited comment on ARROW-640 at 3/11/18 8:16 PM:
--

I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3.
{code:java}
%load_ext Cython
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr

[
  1,
  1,
  1,
  2
]
arr[0] == arr[1]
True
arr[0] == arr[3]
False
word_list = ['test', 'not the same', 'test', 'nope']
word_list[0] == word_list[2]
True
word_list[0] == word_list[1]
False
pa.array.__eq__

set(arr)
---
TypeError Traceback (most recent call last)
 in ()
> 1 set(arr)

TypeError: unhashable type: 'pyarrow.lib.Int64Value'
arr_list = pa.from_pylist([1, 1, 1, 2])
---
AttributeErrorTraceback (most recent call last)
 in ()
> 1 arr_list = pa.from_pylist([1, 1, 1, 2])

AttributeError: module 'pyarrow' has no attribute 'from_pylist'
{code}
 


was (Author: alexhagerman):
I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3.

 

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-11 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627
 ] 

Alex Hagerman commented on ARROW-640:
-

I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3.

 



```python
%load_ext Cython
```


```python
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr
```




    
    [
  1,
  1,
  1,
  2
    ]




```python
arr[0] == arr[1]
```




    True




```python
set(arr)
```


    ---

    TypeError Traceback (most recent call last)

     in ()
    > 1 set(arr)
    

    TypeError: unhashable type: 'pyarrow.lib.Int64Value'



```python
arr_list = pa.from_pylist([1, 1, 1, 2])
```


    ---

    AttributeError    Traceback (most recent call last)

     in ()
    > 1 arr_list = pa.from_pylist([1, 1, 1, 2])
    

    AttributeError: module 'pyarrow' has no attribute 'from_pylist'

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-11 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394627#comment-16394627
 ] 

Alex Hagerman edited comment on ARROW-640 at 3/11/18 8:13 PM:
--

I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3.

 


was (Author: alexhagerman):
I think this has changed since the original ticket. The comparison appears to 
be working. Tested this with string and numbers. Also getting an error on set 
now. Going to continue looking into this, but if anybody has thoughts on this 
I'd be happy to hear them. Also from_pylist appears to have been removed, but I 
didn't find it searching the change log on github only an addition in 0.3.

 



```python
%load_ext Cython
```


```python
import pyarrow as pa

pylist = [1,1,1,2]
arr = pa.array(pylist)
arr
```




    
    [
  1,
  1,
  1,
  2
    ]




```python
arr[0] == arr[1]
```




    True




```python
set(arr)
```


    ---

    TypeError Traceback (most recent call last)

     in ()
    > 1 set(arr)
    

    TypeError: unhashable type: 'pyarrow.lib.Int64Value'



```python
arr_list = pa.from_pylist([1, 1, 1, 2])
```


    ---

    AttributeError    Traceback (most recent call last)

     in ()
    > 1 arr_list = pa.from_pylist([1, 1, 1, 2])
    

    AttributeError: module 'pyarrow' has no attribute 'from_pylist'

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-640) [Python] Arrow scalar values should have a sensible hash and comparison

2018-03-07 Thread Alex Hagerman (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hagerman reassigned ARROW-640:
---

Assignee: Alex Hagerman

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1391) [Python] Benchmarks for python serialization

2018-03-01 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383033#comment-16383033
 ] 

Alex Hagerman commented on ARROW-1391:
--

I see recent commits in the repo for the benchmarks. Is this still needed? If 
so any guidance on where the nightly location might be or how to look into this?

> [Python] Benchmarks for python serialization
> 
>
> Key: ARROW-1391
> URL: https://issues.apache.org/jira/browse/ARROW-1391
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Philipp Moritz
>Priority: Minor
>
> It would be great to have a suite of relevant benchmarks for the Python 
> serialization code in ARROW-759. These could be used to guide profiling and 
> performance improvements.
> Relevant use cases include:
> - dictionaries of large numpy arrays that are used to represent weights of a 
> neural network
> - long lists of primitive types like ints, floats or strings
> - lists of user defined python objects



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data

2018-03-01 Thread Alex Hagerman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382733#comment-16382733
 ] 

Alex Hagerman commented on ARROW-2242:
--

I think these may be related? https://github.com/apache/arrow/issues/1677

> [Python] ParquetFile.read does not accommodate large binary data 
> -
>
> Key: ARROW-2242
> URL: https://issues.apache.org/jira/browse/ARROW-2242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Priority: Major
> Fix For: 0.9.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

46 matches

Mail list logo