[jira] [Commented] (ARROW-1715) [Python] Implement pickling for Column, ChunkedArray, RecordBatch, Table
[ https://issues.apache.org/jira/browse/ARROW-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535982#comment-16535982 ] Dave Hirschfeld commented on ARROW-1715: This has come up in the context of dask.distributed also: https://github.com/dask/distributed/issues/2103 > [Python] Implement pickling for Column, ChunkedArray, RecordBatch, Table > > > Key: ARROW-1715 > URL: https://issues.apache.org/jira/browse/ARROW-1715 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 0.11.0 > > > At the moment the types {{pyarrow.Column/ChunkedArray/RecordBatch/Table}} > cannot be pickled. Although it may not be the fastest way to transport them > from one process to another, it is a very comfortable one. We should > implement a {{__reduce__()}} for all of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2811) [Python] Test serialization for determinism
[ https://issues.apache.org/jira/browse/ARROW-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2811: -- Labels: pull-request-available (was: ) > [Python] Test serialization for determinism > --- > > Key: ARROW-2811 > URL: https://issues.apache.org/jira/browse/ARROW-2811 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Philipp Moritz >Priority: Major > Labels: pull-request-available > > see discussion in https://github.com/apache/arrow/pull/2216 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2811) [Python] Test serialization for determinism
Philipp Moritz created ARROW-2811: - Summary: [Python] Test serialization for determinism Key: ARROW-2811 URL: https://issues.apache.org/jira/browse/ARROW-2811 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz see discussion in https://github.com/apache/arrow/pull/2216 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2810) [Plasma] Plasma public headers leak flatbuffers.h
Wes McKinney created ARROW-2810: --- Summary: [Plasma] Plasma public headers leak flatbuffers.h Key: ARROW-2810 URL: https://issues.apache.org/jira/browse/ARROW-2810 Project: Apache Arrow Issue Type: Bug Components: Plasma (C++) Reporter: Wes McKinney In general, it is not a good idea to include your transitive dependencies if you can avoid it. I discovered this when working on ARROW-1722 so I'm opening an issue -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2802) [Docs] Move release management guide to project wiki
[ https://issues.apache.org/jira/browse/ARROW-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-2802. - Resolution: Fixed Issue resolved by pull request 2226 [https://github.com/apache/arrow/pull/2226] > [Docs] Move release management guide to project wiki > > > Key: ARROW-2802 > URL: https://issues.apache.org/jira/browse/ARROW-2802 > Project: Apache Arrow > Issue Type: Improvement > Components: Wiki >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > I have begun doing this here > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I > think we should remove RELEASE_MANAGEMENT.md and add a note to > dev/release/README.md to navigate to the Confluence page -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2809) [C++] Decrease verbosity of lint checks in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2809: -- Labels: pull-request-available (was: ) > [C++] Decrease verbosity of lint checks in Travis CI > > > Key: ARROW-2809 > URL: https://issues.apache.org/jira/browse/ARROW-2809 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2809) [C++] Decrease verbosity of lint checks in Travis CI
Wes McKinney created ARROW-2809: --- Summary: [C++] Decrease verbosity of lint checks in Travis CI Key: ARROW-2809 URL: https://issues.apache.org/jira/browse/ARROW-2809 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 0.10.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2601) [Python] MemoryPool bytes_allocated causes seg
[ https://issues.apache.org/jira/browse/ARROW-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2601: -- Labels: pull-request-available (was: ) > [Python] MemoryPool bytes_allocated causes seg > -- > > Key: ARROW-2601 > URL: https://issues.apache.org/jira/browse/ARROW-2601 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Alex Hagerman >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.10.0 > > > Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> mp = pa.MemoryPool() > >>> arr = pa.array([1,2,3], memory_pool=mp) > >>> mp.bytes_allocated() > Segmentation fault (core dumped) > I'll dig into this further, but should bytes_alloacted be returning anything > when called like this? Or should it raise NotImplemented? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2808) [Python] Add unit tests for ProxyMemoryPool, enable new default MemoryPool to be constructed
Wes McKinney created ARROW-2808: --- Summary: [Python] Add unit tests for ProxyMemoryPool, enable new default MemoryPool to be constructed Key: ARROW-2808 URL: https://issues.apache.org/jira/browse/ARROW-2808 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.11.0 I could not find unit tests for ProxyMemoryPool -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2784) [C++] MemoryMappedFile::WriteAt allow writing past the end
[ https://issues.apache.org/jira/browse/ARROW-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2784: Fix Version/s: 0.10.0 > [C++] MemoryMappedFile::WriteAt allow writing past the end > -- > > Key: ARROW-2784 > URL: https://issues.apache.org/jira/browse/ARROW-2784 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.9.0 >Reporter: Dimitri Vorona >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > There is a missing check in WriteAt, this PR adds it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2784) [C++] MemoryMappedFile::WriteAt allow writing past the end
[ https://issues.apache.org/jira/browse/ARROW-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2784: --- Assignee: Dimitri Vorona > [C++] MemoryMappedFile::WriteAt allow writing past the end > -- > > Key: ARROW-2784 > URL: https://issues.apache.org/jira/browse/ARROW-2784 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.9.0 >Reporter: Dimitri Vorona >Assignee: Dimitri Vorona >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > There is a missing check in WriteAt, this PR adds it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2553) [C++] Set MACOSX_DEPLOYMENT_TARGET in wheel build
[ https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2553: Component/s: (was: C++) Python > [C++] Set MACOSX_DEPLOYMENT_TARGET in wheel build > - > > Key: ARROW-2553 > URL: https://issues.apache.org/jira/browse/ARROW-2553 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Uwe L. Korn >Priority: Blocker > Fix For: 0.10.0 > > > The current `pyarrow` wheels are not usable on older OSX releases due to a > problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} > to an older OSX release to avoid getting {{Symbol not found: > _os_unfair_lock_lock}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2553) [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build
[ https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2553: Summary: [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build (was: [C++] Set MACOSX_DEPLOYMENT_TARGET in wheel build) > [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build > > > Key: ARROW-2553 > URL: https://issues.apache.org/jira/browse/ARROW-2553 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Uwe L. Korn >Priority: Blocker > Fix For: 0.10.0 > > > The current `pyarrow` wheels are not usable on older OSX releases due to a > problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} > to an older OSX release to avoid getting {{Symbol not found: > _os_unfair_lock_lock}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2300) [Python] python/testing/test_hdfs.sh no longer works
[ https://issues.apache.org/jira/browse/ARROW-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2300: --- Assignee: Krisztian Szucs > [Python] python/testing/test_hdfs.sh no longer works > > > Key: ARROW-2300 > URL: https://issues.apache.org/jira/browse/ARROW-2300 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Krisztian Szucs >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Tried this on a fresh Ubuntu 16.04 install: > {code} > $ ./test_hdfs.sh > + docker build -t arrow-hdfs-test -f hdfs/Dockerfile . > Sending build context to Docker daemon 36.86kB > Step 1/6 : FROM cpcloud86/impala:metastore > manifest for cpcloud86/impala:metastore not found > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2802) [Docs] Move release management guide to project wiki
[ https://issues.apache.org/jira/browse/ARROW-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2802: -- Labels: pull-request-available (was: ) > [Docs] Move release management guide to project wiki > > > Key: ARROW-2802 > URL: https://issues.apache.org/jira/browse/ARROW-2802 > Project: Apache Arrow > Issue Type: Improvement > Components: Wiki >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I have begun doing this here > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I > think we should remove RELEASE_MANAGEMENT.md and add a note to > dev/release/README.md to navigate to the Confluence page -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2802) [Docs] Move release management guide to project wiki
[ https://issues.apache.org/jira/browse/ARROW-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2802: --- Assignee: Wes McKinney > [Docs] Move release management guide to project wiki > > > Key: ARROW-2802 > URL: https://issues.apache.org/jira/browse/ARROW-2802 > Project: Apache Arrow > Issue Type: Improvement > Components: Wiki >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > I have begun doing this here > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I > think we should remove RELEASE_MANAGEMENT.md and add a note to > dev/release/README.md to navigate to the Confluence page -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time
[ https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2656: Fix Version/s: 0.10.0 > [Python] Improve ParquetManifest creation time > --- > > Key: ARROW-2656 > URL: https://issues.apache.org/jira/browse/ARROW-2656 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Gruener >Assignee: Robert Gruener >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > When a parquet dataset is highly partitioned, the time to call the > constructor for > [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588] > takes a significant amount of time since it serially visits directories to > find all parquet files. In a dataset with thousands of partition values this > can take several minutes from a personal laptop. > A quick win to vastly improve this performance would be to use a ThreadPool > to have calls to {{_visit_level}} happen concurrently to prevent wasting a > ton of time waiting on I/O. > An even faster option could be to allow for optional indexing of dataset > metadata in something like the {{common_metadata}}. This could contain all > files in the manifest and their row_group information. This would also allow > for > [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746] > to be implemented efficiently without needing to open every parquet file in > the dataset to retrieve the metadata which is quite time consuming for large > datasets. The main problem with the indexing approach are it requires > immutability of the dataset, which doesn't seem too unreasonable. This > specific implementation seems related to > https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the > write portion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time
[ https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2656: Labels: parquet pull-request-available (was: pull-request-available) > [Python] Improve ParquetManifest creation time > --- > > Key: ARROW-2656 > URL: https://issues.apache.org/jira/browse/ARROW-2656 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Gruener >Assignee: Robert Gruener >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.10.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > When a parquet dataset is highly partitioned, the time to call the > constructor for > [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588] > takes a significant amount of time since it serially visits directories to > find all parquet files. In a dataset with thousands of partition values this > can take several minutes from a personal laptop. > A quick win to vastly improve this performance would be to use a ThreadPool > to have calls to {{_visit_level}} happen concurrently to prevent wasting a > ton of time waiting on I/O. > An even faster option could be to allow for optional indexing of dataset > metadata in something like the {{common_metadata}}. This could contain all > files in the manifest and their row_group information. This would also allow > for > [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746] > to be implemented efficiently without needing to open every parquet file in > the dataset to retrieve the metadata which is quite time consuming for large > datasets. The main problem with the indexing approach are it requires > immutability of the dataset, which doesn't seem too unreasonable. This > specific implementation seems related to > https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the > write portion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2654) [Python] Error with errno 22 when loading 3.6 GB Parquet file
[ https://issues.apache.org/jira/browse/ARROW-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535899#comment-16535899 ] Wes McKinney commented on ARROW-2654: - [~andyreagan] where is the data stored? The error suggests that the {{mmap}} call failed, but without more detail it's hard for me to tell. Can you please test using the appropriate wheel for your platform from https://github.com/kszucs/crossbow/releases/tag/build-163 ? I'm moving this issue off the 0.10.0 for now I noticed that it's not possible to disable memory mapping when reading Parquet files: opened ARROW-2807 > [Python] Error with errno 22 when loading 3.6 GB Parquet file > - > > Key: ARROW-2654 > URL: https://issues.apache.org/jira/browse/ARROW-2654 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Andy Reagan >Priority: Major > Labels: parquet > Fix For: 0.11.0 > > > I saved a file using pandas to_parquet method, but can't read it back in. > Here's the full stack trace: > > {code:java} > Traceback (most recent call last): > File "src/data/CLXP_pull.py", line 214, in > main() > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py", > line 722, in _call_ > return self.main(*args, **kwargs) > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py", > line 697, in main > rv = self.invoke(ctx) > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py", > line 895, in invoke > return ctx.invoke(self.callback, **ctx.params) > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py", > line 535, in invoke > return callback(*args, **kwargs) > File "src/data/CLXP_pull.py", line 188, in main > results[fullname] = pd.read_parquet(os.path.join(project_dir, "data", "raw", > fullname+".parquet"), engine="pyarrow") > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pandas/io/parquet.py", > line 257, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pandas/io/parquet.py", > line 130, in read > **kwargs).to_pandas() > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pyarrow/parquet.py", > line 939, in read_table > pf = ParquetFile(source, metadata=metadata) > File > "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pyarrow/parquet.py", > line 64, in _init_ > self.reader.open(source, metadata=metadata) > File "_parquet.pyx", line 651, in pyarrow._parquet.ParquetReader.open > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Any ideas what could cause this? The file itself is 3.6GB. > I'm running pandas==0.22.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2807) [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files
Wes McKinney created ARROW-2807: --- Summary: [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files Key: ARROW-2807 URL: https://issues.apache.org/jira/browse/ARROW-2807 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney See relevant discussion in ARROW-2654 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1722) [C++] Add linting script to look for C++/CLI issues
[ https://issues.apache.org/jira/browse/ARROW-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1722: -- Labels: pull-request-available (was: ) > [C++] Add linting script to look for C++/CLI issues > --- > > Key: ARROW-1722 > URL: https://issues.apache.org/jira/browse/ARROW-1722 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > This includes: > * Using {{nullptr}} in header files (we must instead use an appropriate macro > to use {{__nullptr}} when the host compiler is C++/CLI) > * Including {{}} in a public header (e.g. header files without "impl" > or "internal" in their name) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2673) [Python] Add documentation + docstring for ARROW-2661
[ https://issues.apache.org/jira/browse/ARROW-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2673: Fix Version/s: (was: 0.10.0) 0.11.0 > [Python] Add documentation + docstring for ARROW-2661 > - > > Key: ARROW-2673 > URL: https://issues.apache.org/jira/browse/ARROW-2673 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1425: Fix Version/s: (was: 0.10.0) 0.11.0 > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Li Jin >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2806) [Python] Inconsistent handling of np.nan
[ https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535873#comment-16535873 ] Wes McKinney commented on ARROW-2806: - And {{pa.array([1., NaN])}} should preserve the NaN > [Python] Inconsistent handling of np.nan > > > Key: ARROW-2806 > URL: https://issues.apache.org/jira/browse/ARROW-2806 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.10.0 > > > Currently we handle {{np.nan}} differently between having a list or a numpy > array as an input to {{pa.array()}}: > {code} > >>> pa.array(np.array([1, np.nan])) > > [ > 1.0, > nan > ] > >>> pa.array([1., np.nan]) > Out[9]: > > [ > 1.0, > NA > ] > {code} > I would actually think the last one is the correct one. Especially once one > casts this to an integer column. There the first one produces a column with > INT_MIN and the second one produces a real null. > But, in {{test_array_conversions_no_sentinel_values}} we check that > {{np.nan}} does not produce a Null. > Even weirder: > {code} > >>> df = pd.DataFrame({'a': [1., None]}) > >>> df > a > 0 1.0 > 1 NaN > >>> pa.Table.from_pandas(df).column(0) > > chunk 0: > [ > 1.0, > NA > ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2806) [Python] Inconsistent handling of np.nan
[ https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535872#comment-16535872 ] Wes McKinney commented on ARROW-2806: - Oof, I actually think {{pa.array([1, NaN])}} should either raise an exception or return a DoubleArray with a NaN, unless {{from_pandas=True}}. > [Python] Inconsistent handling of np.nan > > > Key: ARROW-2806 > URL: https://issues.apache.org/jira/browse/ARROW-2806 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.10.0 > > > Currently we handle {{np.nan}} differently between having a list or a numpy > array as an input to {{pa.array()}}: > {code} > >>> pa.array(np.array([1, np.nan])) > > [ > 1.0, > nan > ] > >>> pa.array([1., np.nan]) > Out[9]: > > [ > 1.0, > NA > ] > {code} > I would actually think the last one is the correct one. Especially once one > casts this to an integer column. There the first one produces a column with > INT_MIN and the second one produces a real null. > But, in {{test_array_conversions_no_sentinel_values}} we check that > {{np.nan}} does not produce a Null. > Even weirder: > {code} > >>> df = pd.DataFrame({'a': [1., None]}) > >>> df > a > 0 1.0 > 1 NaN > >>> pa.Table.from_pandas(df).column(0) > > chunk 0: > [ > 1.0, > NA > ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2806) [Python] Inconsistent handling of np.nan
[ https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535803#comment-16535803 ] Uwe L. Korn commented on ARROW-2806: [~wesmckinn] Would it be ok for you to change the test so that {{np.nan}} is always a null indicator for Arrow? > [Python] Inconsistent handling of np.nan > > > Key: ARROW-2806 > URL: https://issues.apache.org/jira/browse/ARROW-2806 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.10.0 > > > Currently we handle {{np.nan}} differently between having a list or a numpy > array as an input to {{pa.array()}}: > {code} > >>> pa.array(np.array([1, np.nan])) > > [ > 1.0, > nan > ] > >>> pa.array([1., np.nan]) > Out[9]: > > [ > 1.0, > NA > ] > {code} > I would actually think the last one is the correct one. Especially once one > casts this to an integer column. There the first one produces a column with > INT_MIN and the second one produces a real null. > But, in {{test_array_conversions_no_sentinel_values}} we check that > {{np.nan}} does not produce a Null. > Even weirder: > {code} > >>> df = pd.DataFrame({'a': [1., None]}) > >>> df > a > 0 1.0 > 1 NaN > >>> pa.Table.from_pandas(df).column(0) > > chunk 0: > [ > 1.0, > NA > ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2806) [Python] Inconsistent handling of np.nan
Uwe L. Korn created ARROW-2806: -- Summary: [Python] Inconsistent handling of np.nan Key: ARROW-2806 URL: https://issues.apache.org/jira/browse/ARROW-2806 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Uwe L. Korn Fix For: 0.10.0 Currently we handle {{np.nan}} differently between having a list or a numpy array as an input to {{pa.array()}}: {code} >>> pa.array(np.array([1, np.nan])) [ 1.0, nan ] >>> pa.array([1., np.nan]) Out[9]: [ 1.0, NA ] {code} I would actually think the last one is the correct one. Especially once one casts this to an integer column. There the first one produces a column with INT_MIN and the second one produces a real null. But, in {{test_array_conversions_no_sentinel_values}} we check that {{np.nan}} does not produce a Null. Even weirder: {code} >>> df = pd.DataFrame({'a': [1., None]}) >>> df a 0 1.0 1 NaN >>> pa.Table.from_pandas(df).column(0) chunk 0: [ 1.0, NA ] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2634) [Go] Add LICENSE additions for Go subproject
[ https://issues.apache.org/jira/browse/ARROW-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2634. - Resolution: Fixed Issue resolved by pull request 2221 [https://github.com/apache/arrow/pull/2221] > [Go] Add LICENSE additions for Go subproject > > > Key: ARROW-2634 > URL: https://issues.apache.org/jira/browse/ARROW-2634 > Project: Apache Arrow > Issue Type: Improvement > Components: Go >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The Arrow Go codebase contains code from the Go project. This needs to be > mentioned in the main LICENSE.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed
[ https://issues.apache.org/jira/browse/ARROW-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2805: Fix Version/s: (was: JS-0.4.0) 0.10.0 > [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA > is not installed > -- > > Key: ARROW-2805 > URL: https://issues.apache.org/jira/browse/ARROW-2805 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Philipp Moritz >Priority: Major > Labels: pull-request-available, tensorflow > Fix For: 0.10.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > TensorFlow version: 1.7 (GPU enabled but CUDA is not installed) > tensorflow-gpu was installed via pip install > ``` > import ray > File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in > > import pyarrow # noqa: F401 > File > "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py", > line 55, in > compat.import_tensorflow_extension() > File > "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", > line 193, in import_tensorflow_extension > ctypes.CDLL(ext) > File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__ > self._handle = _dlopen(self._name, mode) > OSError: libcublas.so.9.0: cannot open shared object file: No such file or > directory > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)