[jira] [Commented] (ARROW-1715) [Python] Implement pickling for Column, ChunkedArray, RecordBatch, Table

2018-07-07 Thread Dave Hirschfeld (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535982#comment-16535982
 ] 

Dave Hirschfeld commented on ARROW-1715:


This has come up in the context of dask.distributed also:
https://github.com/dask/distributed/issues/2103

> [Python] Implement pickling for Column, ChunkedArray, RecordBatch, Table
> 
>
> Key: ARROW-1715
> URL: https://issues.apache.org/jira/browse/ARROW-1715
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.11.0
>
>
> At the moment the types {{pyarrow.Column/ChunkedArray/RecordBatch/Table}} 
> cannot be pickled. Although it may not be the fastest way to transport them 
> from one process to another, it is a very comfortable one. We should 
> implement a {{__reduce__()}} for all of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2811) [Python] Test serialization for determinism

2018-07-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2811:
--
Labels: pull-request-available  (was: )

> [Python] Test serialization for determinism
> ---
>
> Key: ARROW-2811
> URL: https://issues.apache.org/jira/browse/ARROW-2811
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>
> see discussion in https://github.com/apache/arrow/pull/2216



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2811) [Python] Test serialization for determinism

2018-07-07 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2811:
-

 Summary: [Python] Test serialization for determinism
 Key: ARROW-2811
 URL: https://issues.apache.org/jira/browse/ARROW-2811
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


see discussion in https://github.com/apache/arrow/pull/2216



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2810) [Plasma] Plasma public headers leak flatbuffers.h

2018-07-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2810:
---

 Summary: [Plasma] Plasma public headers leak flatbuffers.h
 Key: ARROW-2810
 URL: https://issues.apache.org/jira/browse/ARROW-2810
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Wes McKinney


In general, it is not a good idea to include your transitive dependencies if 
you can avoid it. I discovered this when working on ARROW-1722 so I'm opening 
an issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2802) [Docs] Move release management guide to project wiki

2018-07-07 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-2802.
-
Resolution: Fixed

Issue resolved by pull request 2226
[https://github.com/apache/arrow/pull/2226]

> [Docs] Move release management guide to project wiki
> 
>
> Key: ARROW-2802
> URL: https://issues.apache.org/jira/browse/ARROW-2802
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I have begun doing this here 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I 
> think we should remove RELEASE_MANAGEMENT.md and add a note to 
> dev/release/README.md to navigate to the Confluence page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2809) [C++] Decrease verbosity of lint checks in Travis CI

2018-07-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2809:
--
Labels: pull-request-available  (was: )

> [C++] Decrease verbosity of lint checks in Travis CI
> 
>
> Key: ARROW-2809
> URL: https://issues.apache.org/jira/browse/ARROW-2809
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2809) [C++] Decrease verbosity of lint checks in Travis CI

2018-07-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2809:
---

 Summary: [C++] Decrease verbosity of lint checks in Travis CI
 Key: ARROW-2809
 URL: https://issues.apache.org/jira/browse/ARROW-2809
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.10.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2601) [Python] MemoryPool bytes_allocated causes seg

2018-07-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2601:
--
Labels: pull-request-available  (was: )

> [Python] MemoryPool bytes_allocated causes seg
> --
>
> Key: ARROW-2601
> URL: https://issues.apache.org/jira/browse/ARROW-2601
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Alex Hagerman
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> mp = pa.MemoryPool()
> >>> arr = pa.array([1,2,3], memory_pool=mp)
> >>> mp.bytes_allocated()
> Segmentation fault (core dumped)
> I'll dig into this further, but should bytes_alloacted be returning anything 
> when called like this? Or should it raise NotImplemented?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2808) [Python] Add unit tests for ProxyMemoryPool, enable new default MemoryPool to be constructed

2018-07-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2808:
---

 Summary: [Python] Add unit tests for ProxyMemoryPool, enable new 
default MemoryPool to be constructed
 Key: ARROW-2808
 URL: https://issues.apache.org/jira/browse/ARROW-2808
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.11.0


I could not find unit tests for ProxyMemoryPool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2784) [C++] MemoryMappedFile::WriteAt allow writing past the end

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2784:

Fix Version/s: 0.10.0

> [C++] MemoryMappedFile::WriteAt allow writing past the end
> --
>
> Key: ARROW-2784
> URL: https://issues.apache.org/jira/browse/ARROW-2784
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> There is a missing check in WriteAt, this PR adds it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2784) [C++] MemoryMappedFile::WriteAt allow writing past the end

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2784:
---

Assignee: Dimitri Vorona

> [C++] MemoryMappedFile::WriteAt allow writing past the end
> --
>
> Key: ARROW-2784
> URL: https://issues.apache.org/jira/browse/ARROW-2784
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.9.0
>Reporter: Dimitri Vorona
>Assignee: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> There is a missing check in WriteAt, this PR adds it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2553) [C++] Set MACOSX_DEPLOYMENT_TARGET in wheel build

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2553:

Component/s: (was: C++)
 Python

> [C++] Set MACOSX_DEPLOYMENT_TARGET in wheel build
> -
>
> Key: ARROW-2553
> URL: https://issues.apache.org/jira/browse/ARROW-2553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe L. Korn
>Priority: Blocker
> Fix For: 0.10.0
>
>
> The current `pyarrow` wheels are not usable on older OSX releases due to a 
> problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} 
> to an older OSX release to avoid getting {{Symbol not found: 
> _os_unfair_lock_lock}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2553) [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2553:

Summary: [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build  (was: [C++] 
Set MACOSX_DEPLOYMENT_TARGET in wheel build)

> [Python] Set MACOSX_DEPLOYMENT_TARGET in wheel build
> 
>
> Key: ARROW-2553
> URL: https://issues.apache.org/jira/browse/ARROW-2553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe L. Korn
>Priority: Blocker
> Fix For: 0.10.0
>
>
> The current `pyarrow` wheels are not usable on older OSX releases due to a 
> problem in the newest Xcode SDK. We need to set {{MACOSX_DEPLOYMENT_TARGET}} 
> to an older OSX release to avoid getting {{Symbol not found: 
> _os_unfair_lock_lock}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2300) [Python] python/testing/test_hdfs.sh no longer works

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2300:
---

Assignee: Krisztian Szucs

> [Python] python/testing/test_hdfs.sh no longer works
> 
>
> Key: ARROW-2300
> URL: https://issues.apache.org/jira/browse/ARROW-2300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Tried this on a fresh Ubuntu 16.04 install:
> {code}
> $ ./test_hdfs.sh 
> + docker build -t arrow-hdfs-test -f hdfs/Dockerfile .
> Sending build context to Docker daemon  36.86kB
> Step 1/6 : FROM cpcloud86/impala:metastore
> manifest for cpcloud86/impala:metastore not found
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2802) [Docs] Move release management guide to project wiki

2018-07-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2802:
--
Labels: pull-request-available  (was: )

> [Docs] Move release management guide to project wiki
> 
>
> Key: ARROW-2802
> URL: https://issues.apache.org/jira/browse/ARROW-2802
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> I have begun doing this here 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I 
> think we should remove RELEASE_MANAGEMENT.md and add a note to 
> dev/release/README.md to navigate to the Confluence page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2802) [Docs] Move release management guide to project wiki

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2802:
---

Assignee: Wes McKinney

> [Docs] Move release management guide to project wiki
> 
>
> Key: ARROW-2802
> URL: https://issues.apache.org/jira/browse/ARROW-2802
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> I have begun doing this here 
> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I 
> think we should remove RELEASE_MANAGEMENT.md and add a note to 
> dev/release/README.md to navigate to the Confluence page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2656:

Fix Version/s: 0.10.0

> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
>  takes a significant amount of time since it serially visits directories to 
> find all parquet files. In a dataset with thousands of partition values this 
> can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2656) [Python] Improve ParquetManifest creation time

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2656:

Labels: parquet pull-request-available  (was: pull-request-available)

> [Python] Improve ParquetManifest creation time 
> ---
>
> Key: ARROW-2656
> URL: https://issues.apache.org/jira/browse/ARROW-2656
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Gruener
>Assignee: Robert Gruener
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When a parquet dataset is highly partitioned, the time to call the 
> constructor for 
> [ParquetManifest|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L588]
>  takes a significant amount of time since it serially visits directories to 
> find all parquet files. In a dataset with thousands of partition values this 
> can take several minutes from a personal laptop.
> A quick win to vastly improve this performance would be to use a ThreadPool 
> to have calls to {{_visit_level}} happen concurrently to prevent wasting a 
> ton of time waiting on I/O.
> An even faster option could be to allow for optional indexing of dataset 
> metadata in something like the {{common_metadata}}. This could contain all 
> files in the manifest and their row_group information. This would also allow 
> for 
> [split_row_groups|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L746]
>  to be implemented efficiently without needing to open every parquet file in 
> the dataset to retrieve the metadata which is quite time consuming for large 
> datasets. The main problem with the indexing approach are it requires 
> immutability of the dataset, which doesn't seem too unreasonable. This 
> specific implementation seems related to 
> https://issues.apache.org/jira/browse/ARROW-1983 however that only covers the 
> write portion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2654) [Python] Error with errno 22 when loading 3.6 GB Parquet file

2018-07-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535899#comment-16535899
 ] 

Wes McKinney commented on ARROW-2654:
-

[~andyreagan] where is the data stored? The error suggests that the {{mmap}} 
call failed, but without more detail it's hard for me to tell. Can you please 
test using the appropriate wheel for your platform from

https://github.com/kszucs/crossbow/releases/tag/build-163

? I'm moving this issue off the 0.10.0 for now

I noticed that it's not possible to disable memory mapping when reading Parquet 
files: opened ARROW-2807

> [Python] Error with errno 22 when loading 3.6 GB Parquet file
> -
>
> Key: ARROW-2654
> URL: https://issues.apache.org/jira/browse/ARROW-2654
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Andy Reagan
>Priority: Major
>  Labels: parquet
> Fix For: 0.11.0
>
>
> I saved a file using pandas to_parquet method, but can't read it back in. 
> Here's the full stack trace:
>  
> {code:java}
> Traceback (most recent call last):
> File "src/data/CLXP_pull.py", line 214, in 
>  main()
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py",
>  line 722, in _call_
>  return self.main(*args, **kwargs)
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py",
>  line 697, in main
>  rv = self.invoke(ctx)
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py",
>  line 895, in invoke
>  return ctx.invoke(self.callback, **ctx.params)
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/click/core.py",
>  line 535, in invoke
>  return callback(*args, **kwargs)
>  File "src/data/CLXP_pull.py", line 188, in main
>  results[fullname] = pd.read_parquet(os.path.join(project_dir, "data", "raw", 
> fullname+".parquet"), engine="pyarrow")
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pandas/io/parquet.py",
>  line 257, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pandas/io/parquet.py",
>  line 130, in read
>  **kwargs).to_pandas()
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 939, in read_table
>  pf = ParquetFile(source, metadata=metadata)
>  File 
> "/Users/mm51929/projects/2018/03-advisor-recruiting/pyenv/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 64, in _init_
>  self.reader.open(source, metadata=metadata)
>  File "_parquet.pyx", line 651, in pyarrow._parquet.ParquetReader.open
>  File "error.pxi", line 79, in pyarrow.lib.check_status
>  pyarrow.lib.ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
> {code}
> Any ideas what could cause this? The file itself is 3.6GB.
> I'm running pandas==0.22.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2807) [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files

2018-07-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2807:
---

 Summary: [Python] Enable memory-mapping to be toggled in 
get_reader when reading Parquet files
 Key: ARROW-2807
 URL: https://issues.apache.org/jira/browse/ARROW-2807
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


See relevant discussion in ARROW-2654



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1722) [C++] Add linting script to look for C++/CLI issues

2018-07-07 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1722:
--
Labels: pull-request-available  (was: )

> [C++] Add linting script to look for C++/CLI issues
> ---
>
> Key: ARROW-1722
> URL: https://issues.apache.org/jira/browse/ARROW-1722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> This includes:
> * Using {{nullptr}} in header files (we must instead use an appropriate macro 
> to use {{__nullptr}} when the host compiler is C++/CLI)
> * Including {{}} in a public header (e.g. header files without "impl" 
> or "internal" in their name)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2673) [Python] Add documentation + docstring for ARROW-2661

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2673:

Fix Version/s: (was: 0.10.0)
   0.11.0

> [Python] Add documentation + docstring for ARROW-2661
> -
>
> Key: ARROW-2673
> URL: https://issues.apache.org/jira/browse/ARROW-2673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1425:

Fix Version/s: (was: 0.10.0)
   0.11.0

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2806) [Python] Inconsistent handling of np.nan

2018-07-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535873#comment-16535873
 ] 

Wes McKinney commented on ARROW-2806:
-

And {{pa.array([1., NaN])}} should preserve the NaN

> [Python] Inconsistent handling of np.nan
> 
>
> Key: ARROW-2806
> URL: https://issues.apache.org/jira/browse/ARROW-2806
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently we handle {{np.nan}} differently between having a list or a numpy 
> array as an input to {{pa.array()}}:
> {code}
> >>> pa.array(np.array([1, np.nan]))
> 
> [
>   1.0,
>   nan
> ]
> >>> pa.array([1., np.nan])
> Out[9]:
> 
> [
>   1.0,
>   NA
> ]
> {code}
> I would actually think the last one is the correct one. Especially once one 
> casts this to an integer column. There the first one produces a column with 
> INT_MIN and the second one produces a real null.
> But, in {{test_array_conversions_no_sentinel_values}} we check that 
> {{np.nan}} does not produce a Null.
> Even weirder: 
> {code}
> >>> df = pd.DataFrame({'a': [1., None]})
> >>> df
>  a
> 0  1.0
> 1  NaN
> >>> pa.Table.from_pandas(df).column(0)
> 
> chunk 0: 
> [
>   1.0,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2806) [Python] Inconsistent handling of np.nan

2018-07-07 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535872#comment-16535872
 ] 

Wes McKinney commented on ARROW-2806:
-

Oof, I actually think {{pa.array([1, NaN])}} should either raise an exception 
or return a DoubleArray with a NaN, unless {{from_pandas=True}}.

> [Python] Inconsistent handling of np.nan
> 
>
> Key: ARROW-2806
> URL: https://issues.apache.org/jira/browse/ARROW-2806
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently we handle {{np.nan}} differently between having a list or a numpy 
> array as an input to {{pa.array()}}:
> {code}
> >>> pa.array(np.array([1, np.nan]))
> 
> [
>   1.0,
>   nan
> ]
> >>> pa.array([1., np.nan])
> Out[9]:
> 
> [
>   1.0,
>   NA
> ]
> {code}
> I would actually think the last one is the correct one. Especially once one 
> casts this to an integer column. There the first one produces a column with 
> INT_MIN and the second one produces a real null.
> But, in {{test_array_conversions_no_sentinel_values}} we check that 
> {{np.nan}} does not produce a Null.
> Even weirder: 
> {code}
> >>> df = pd.DataFrame({'a': [1., None]})
> >>> df
>  a
> 0  1.0
> 1  NaN
> >>> pa.Table.from_pandas(df).column(0)
> 
> chunk 0: 
> [
>   1.0,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2806) [Python] Inconsistent handling of np.nan

2018-07-07 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535803#comment-16535803
 ] 

Uwe L. Korn commented on ARROW-2806:


[~wesmckinn] Would it be ok for you to change the test so that {{np.nan}} is 
always a null indicator for Arrow?

> [Python] Inconsistent handling of np.nan
> 
>
> Key: ARROW-2806
> URL: https://issues.apache.org/jira/browse/ARROW-2806
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently we handle {{np.nan}} differently between having a list or a numpy 
> array as an input to {{pa.array()}}:
> {code}
> >>> pa.array(np.array([1, np.nan]))
> 
> [
>   1.0,
>   nan
> ]
> >>> pa.array([1., np.nan])
> Out[9]:
> 
> [
>   1.0,
>   NA
> ]
> {code}
> I would actually think the last one is the correct one. Especially once one 
> casts this to an integer column. There the first one produces a column with 
> INT_MIN and the second one produces a real null.
> But, in {{test_array_conversions_no_sentinel_values}} we check that 
> {{np.nan}} does not produce a Null.
> Even weirder: 
> {code}
> >>> df = pd.DataFrame({'a': [1., None]})
> >>> df
>  a
> 0  1.0
> 1  NaN
> >>> pa.Table.from_pandas(df).column(0)
> 
> chunk 0: 
> [
>   1.0,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2806) [Python] Inconsistent handling of np.nan

2018-07-07 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2806:
--

 Summary: [Python] Inconsistent handling of np.nan
 Key: ARROW-2806
 URL: https://issues.apache.org/jira/browse/ARROW-2806
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Uwe L. Korn
 Fix For: 0.10.0


Currently we handle {{np.nan}} differently between having a list or a numpy 
array as an input to {{pa.array()}}:

{code}
>>> pa.array(np.array([1, np.nan]))

[
  1.0,
  nan
]

>>> pa.array([1., np.nan])
Out[9]:

[
  1.0,
  NA
]
{code}

I would actually think the last one is the correct one. Especially once one 
casts this to an integer column. There the first one produces a column with 
INT_MIN and the second one produces a real null.

But, in {{test_array_conversions_no_sentinel_values}} we check that {{np.nan}} 
does not produce a Null.

Even weirder: 

{code}
>>> df = pd.DataFrame({'a': [1., None]})
>>> df
 a
0  1.0
1  NaN
>>> pa.Table.from_pandas(df).column(0)

chunk 0: 
[
  1.0,
  NA
]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2634) [Go] Add LICENSE additions for Go subproject

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2634.
-
Resolution: Fixed

Issue resolved by pull request 2221
[https://github.com/apache/arrow/pull/2221]

> [Go] Add LICENSE additions for Go subproject
> 
>
> Key: ARROW-2634
> URL: https://issues.apache.org/jira/browse/ARROW-2634
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Go
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Arrow Go codebase contains code from the Go project. This needs to be 
> mentioned in the main LICENSE.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed

2018-07-07 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2805:

Fix Version/s: (was: JS-0.4.0)
   0.10.0

> [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA 
> is not installed
> --
>
> Key: ARROW-2805
> URL: https://issues.apache.org/jira/browse/ARROW-2805
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available, tensorflow
> Fix For: 0.10.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> TensorFlow version: 1.7 (GPU enabled but CUDA is not installed)
> tensorflow-gpu was installed via pip install
> ```
> import ray
>  File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in 
> 
>  import pyarrow # noqa: F401
>  File 
> "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py",
>  line 55, in 
>  compat.import_tensorflow_extension()
>  File 
> "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", 
> line 193, in import_tensorflow_extension
>  ctypes.CDLL(ext)
>  File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__
>  self._handle = _dlopen(self._name, mode)
> OSError: libcublas.so.9.0: cannot open shared object file: No such file or 
> directory
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)