date:20191028

[jira] [Resolved] (ARROW-6809) [RUBY] Gem does not install on macOS due to glib2 3.3.7 compilation failure

2019-10-28 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6809.
-
Fix Version/s: 0.15.1
   1.0.0
   Resolution: Fixed

ARROW-6777 solves this.

> [RUBY] Gem does not install on macOS due to glib2 3.3.7 compilation failure
> ---
>
> Key: ARROW-6809
> URL: https://issues.apache.org/jira/browse/ARROW-6809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
>Affects Versions: 0.15.0
> Environment: macOS Mojave 10.14.6
> Ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
> Xcode 10.3
>Reporter: Keith Wedinger
>Assignee: Kouhei Sutou
>Priority: Blocker
> Fix For: 1.0.0, 0.15.1
>
>
> *System information:*
>  * macOS Mojave 10.14.6
>  * Ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18] managed via 
> rbenv
> *Reproduction steps:*
> Run {{gem install red-arrow}}
> *Observe:*
> The following compilation errors occur during compilation of dependent gem 
> glib2 3.3.7:
> {code}
> Building native extensions. This could take a while...
> ERROR:  Error installing red-arrow:
>   ERROR: Failed to build gem native extension.
> current directory: 
> /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/glib2-3.3.7/ext/glib2
> /Users/kwedinger/.rbenv/versions/2.6.3/bin/ruby -I 
> /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/2.6.0 -r 
> ./siteconf20191007-84053-1y4ly2q.rb extconf.rb
> checking for --enable-debug-build option... no
> checking for -Wall option to compiler... yes
> checking for -Waggregate-return option to compiler... yes
> checking for -Wcast-align option to compiler... yes
> checking for -Wextra option to compiler... no
> checking for -Wformat=2 option to compiler... yes
> checking for -Winit-self option to compiler... yes
> checking for -Wlarger-than-65500 option to compiler... yes
> checking for -Wmissing-declarations option to compiler... yes
> checking for -Wmissing-format-attribute option to compiler... yes
> checking for -Wmissing-include-dirs option to compiler... yes
> checking for -Wmissing-noreturn option to compiler... yes
> checking for -Wmissing-prototypes option to compiler... yes
> checking for -Wnested-externs option to compiler... yes
> checking for -Wold-style-definition option to compiler... yes
> checking for -Wpacked option to compiler... yes
> checking for -Wp,-D_FORTIFY_SOURCE=2 option to compiler... yes
> checking for -Wpointer-arith option to compiler... yes
> checking for -Wswitch-default option to compiler... yes
> checking for -Wswitch-enum option to compiler... yes
> checking for -Wundef option to compiler... yes
> checking for -Wout-of-line-declaration option to compiler... yes
> checking for -Wunsafe-loop-optimizations option to compiler... no
> checking for -Wwrite-strings option to compiler... yes
> checking for Homebrew... yes
> checking for gobject-2.0 version (>= 2.12.0)... yes
> checking for gthread-2.0... yes
> checking for unistd.h... yes
> checking for io.h... no
> checking for g_spawn_close_pid() in glib.h... yes
> checking for g_thread_init() in glib.h... yes
> checking for g_main_depth() in glib.h... yes
> checking for g_listenv() in glib.h... yes
> checking for rb_check_array_type() in ruby.h... yes
> checking for rb_check_hash_type() in ruby.h... yes
> checking for rb_exec_recursive() in ruby.h... yes
> checking for rb_errinfo() in ruby.h... yes
> checking for rb_thread_call_without_gvl() in ruby.h... yes
> checking for ruby_native_thread_p() in ruby.h... yes
> checking for rb_thread_call_with_gvl() in ruby.h... yes
> checking for rb_gc_register_mark_object() in ruby.h... yes
> checking for rb_exc_new_str() in ruby.h... yes
> checking for rb_enc_str_new_static() in ruby.h... yes
> checking for curr_thread in ruby.h,node.h... no
> checking for rb_curr_thread in ruby.h,node.h... no
> creating ruby-glib2.pc
> creating glib-enum-types.c
> creating glib-enum-types.h
> creating Makefile
> current directory: 
> /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/glib2-3.3.7/ext/glib2
> make "DESTDIR=" clean
> current directory: 
> /Users/kwedinger/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/glib2-3.3.7/ext/glib2
> make "DESTDIR="
> compiling rbglib-gc.c
> compiling rbgobj_signal.c
> compiling rbglib_int64.c
> compiling rbglib_convert.c
> compiling rbglib_bookmarkfile.c
> compiling rbglib-variant.c
> compiling glib-enum-types.c
> glib-enum-types.c:632:9: warning: 'G_SPAWN_ERROR_2BIG' is deprecated: Use 
> 'G_SPAWN_ERROR_TOO_BIG' instead [-Wdeprecated-declarations]
>   { G_SPAWN_ERROR_2BIG, "G_SPAWN_ERROR_2BIG", "2big" },
> ^
> /usr/local/Cellar/glib/2.62.1/include/glib-2.0/glib/gspawn.h:76:22: note: 
> 'G_SPAWN_ERROR_2BIG' has been explicitly marked

[jira] [Updated] (ARROW-7014) [Developer] Write script to verify Linux wheels given local environment with conda or virtualenv

2019-10-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7014:
--
Labels: pull-request-available  (was: )

> [Developer] Write script to verify Linux wheels given local environment with 
> conda or virtualenv
> 
>
> Key: ARROW-7014
> URL: https://issues.apache.org/jira/browse/ARROW-7014
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Developer Tools, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Facilitate testing RC wheels. Also test checksum and sig



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6340) [R] Implements low-level bindings to Dataset classes

2019-10-28 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6340:
---
Description: 
The following classes should be accessible from R:
 * class DataSource
 * class DataSourceDiscovery
 * class Dataset
 * class ScanContext, ScanOptions, ScanTask
 * class ScannerBuilder
 * class Scanner

The end result is reading a directory of parquet files as a single stream. One 
should be able to re-implement [https://github.com/apache/arrow/pull/5720] in 
R. 

See also [https://github.com/apache/arrow/pull/5675/files] for another 
end-to-end example in C++.

  was:
The following classes should be accessible from R:

 * class DataSource
 * class DataSourceDiscovery
 * class Dataset
 * class ScanContext, ScanOptions, ScanTask
 * class ScannerBuilder
 * class Scanner

The end result is reading a directory of parquet files as a single stream. One 
should be able to re-implement [https://github.com/apache/arrow/pull/5720] in R.


> [R] Implements low-level bindings to Dataset classes
> 
>
> Key: ARROW-6340
> URL: https://issues.apache.org/jira/browse/ARROW-6340
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Dataset, R
>Reporter: Francois Saint-Jacques
>Assignee: Romain Francois
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The following classes should be accessible from R:
>  * class DataSource
>  * class DataSourceDiscovery
>  * class Dataset
>  * class ScanContext, ScanOptions, ScanTask
>  * class ScannerBuilder
>  * class Scanner
> The end result is reading a directory of parquet files as a single stream. 
> One should be able to re-implement 
> [https://github.com/apache/arrow/pull/5720] in R. 
> See also [https://github.com/apache/arrow/pull/5675/files] for another 
> end-to-end example in C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6987) [CI] Travis OSX failing to install sdk headers

2019-10-28 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6987:
---
Labels:   (was: pull-request-available)

> [CI] Travis OSX failing to install sdk headers
> --
>
> Key: ARROW-6987
> URL: https://issues.apache.org/jira/browse/ARROW-6987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code:java}
> sudo installer -pkg 
> /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg
>  -target /343installer: Package name is 
> macOS_SDK_headers_for_macOS_10.14344installer: Certificate used to sign 
> package is not trusted. Use -allowUntrusted to override.345The command 
> "$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library --homebrew" 
> failed and exited with 1 during .
> {code}
> See [https://travis-ci.org/apache/arrow/jobs/602434884#L342-L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2019-10-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961597#comment-16961597
 ] 

Wes McKinney commented on ARROW-7017:
-

I think the jury is out (for example, I'm not totally convinced) on having LLVM 
as a hard requirement for running simple expressions. I'm not sure what will 
end up being most desirable long term

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2019-10-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961597#comment-16961597
 ] 

Wes McKinney edited comment on ARROW-7017 at 10/29/19 1:51 AM:
---

I think the jury is out (for example, I'm not totally convinced) on having LLVM 
compilation as a hard requirement for running simple expressions. I'm not sure 
what will end up being most desirable long term


was (Author: wesmckinn):
I think the jury is out (for example, I'm not totally convinced) on having LLVM 
as a hard requirement for running simple expressions. I'm not sure what will 
end up being most desirable long term

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6993) [CI] Macos SDK installation fails on Travis

2019-10-28 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961592#comment-16961592
 ] 

Kouhei Sutou commented on ARROW-6993:
-

This duplicates ARROW-6987.

> [CI]  Macos SDK installation fails on Travis
> 
>
> Key: ARROW-6993
> URL: https://issues.apache.org/jira/browse/ARROW-6993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324
> Pass -allowUntrasted flag during the installation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6993) [CI] Macos SDK installation fails on Travis

2019-10-28 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6993.
-
Resolution: Duplicate

> [CI]  Macos SDK installation fails on Travis
> 
>
> Key: ARROW-6993
> URL: https://issues.apache.org/jira/browse/ARROW-6993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324
> Pass -allowUntrasted flag during the installation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7013) [C++] arrow-dataset pkgconfig is incomplete

2019-10-28 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-7013.
-
Resolution: Fixed

Issue resolved by pull request 5747
[https://github.com/apache/arrow/pull/5747]

> [C++] arrow-dataset pkgconfig is incomplete
> ---
>
> Key: ARROW-7013
> URL: https://issues.apache.org/jira/browse/ARROW-7013
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Unlike the other *.pc.in files, it doesn't include a {{Libs}} field, so 
> passing the result of what is found by pkgconfig results in the lib still not 
> being found. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2019-10-28 Thread Jacques Nadeau (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961565#comment-16961565
 ] 

Jacques Nadeau commented on ARROW-7017:
---

What's the thinking of building these a second time here as opposed to just 
adding utility methods over Gandiva for specific patterns? My experience is 
that it is very rare that people only need to do a single expression.

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6993) [CI] Macos SDK installation fails on Travis

2019-10-28 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961561#comment-16961561
 ] 

Neal Richardson commented on ARROW-6993:


[~kou] this is causing the GLib tests to fail on master

> [CI]  Macos SDK installation fails on Travis
> 
>
> Key: ARROW-6993
> URL: https://issues.apache.org/jira/browse/ARROW-6993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324
> Pass -allowUntrasted flag during the installation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2019-10-28 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7017:
--
Component/s: C++

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2019-10-28 Thread Francois Saint-Jacques (Jira)

Francois Saint-Jacques created ARROW-7017:
-

 Summary: [C++] Refactor AddKernel to support other operations and 
types
 Key: ARROW-7017
 URL: https://issues.apache.org/jira/browse/ARROW-7017
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Compute
Reporter: Francois Saint-Jacques


* Should avoid using builders (and/or NULLs) since the output shape is known a 
compute time.
 * Should be refatored to support other operations, e.g. Substraction, 
Multiplication.
 * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7016) [Developer][Python] Write script to verify Windows wheels given local environment with conda

2019-10-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7016:
---

 Summary: [Developer][Python] Write script to verify Windows wheels 
given local environment with conda
 Key: ARROW-7016
 URL: https://issues.apache.org/jira/browse/ARROW-7016
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Developer Tools, Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Windows version of ARROW-7014



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7014) [Developer] Write script to verify Linux wheels given local environment with conda or virtualenv

2019-10-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7014:
---

 Summary: [Developer] Write script to verify Linux wheels given 
local environment with conda or virtualenv
 Key: ARROW-7014
 URL: https://issues.apache.org/jira/browse/ARROW-7014
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Developer Tools, Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Facilitate testing RC wheels. Also test checksum and sig



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7015) [Developer] Write script to verify macOS wheels given local environment with conda or virtualenv

2019-10-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7015:
---

 Summary: [Developer] Write script to verify macOS wheels given 
local environment with conda or virtualenv
 Key: ARROW-7015
 URL: https://issues.apache.org/jira/browse/ARROW-7015
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Developer Tools, Python
Reporter: Wes McKinney
 Fix For: 1.0.0


macOS analogue to ARROW-7014



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7013) [C++] arrow-dataset pkgconfig is incomplete

2019-10-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7013:
--
Labels: pull-request-available  (was: )

> [C++] arrow-dataset pkgconfig is incomplete
> ---
>
> Key: ARROW-7013
> URL: https://issues.apache.org/jira/browse/ARROW-7013
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Dataset
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Unlike the other *.pc.in files, it doesn't include a {{Libs}} field, so 
> passing the result of what is found by pkgconfig results in the lib still not 
> being found. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7013) [C++] arrow-dataset pkgconfig is incomplete

2019-10-28 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7013:
--

 Summary: [C++] arrow-dataset pkgconfig is incomplete
 Key: ARROW-7013
 URL: https://issues.apache.org/jira/browse/ARROW-7013
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Unlike the other *.pc.in files, it doesn't include a {{Libs}} field, so passing 
the result of what is found by pkgconfig results in the lib still not being 
found. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-2880) [Packaging] Script like verify-release-candidate.sh for automated testing of conda and wheel Python packages in ASF dist

2019-10-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961499#comment-16961499
 ] 

Wes McKinney commented on ARROW-2880:
-

We can use Crossbow to do the verification in controlled environments. 

> [Packaging] Script like verify-release-candidate.sh for automated testing of 
> conda and wheel Python packages in ASF dist
> 
>
> Key: ARROW-2880
> URL: https://issues.apache.org/jira/browse/ARROW-2880
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging
>Reporter: Wes McKinney
>Priority: Major
>
> We have a script for verifying a source release candidate. We should make a 
> similar script to test out the wheels and conda packages for the supported 
> Python versions (2.7, 3.5, 3.6, soon 3.7) in an automated fashion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7007) [C++] Enable mmap option for LocalFs

2019-10-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961494#comment-16961494
 ] 

Wes McKinney commented on ARROW-7007:
-

Might consider whether there is another approach to this problem. Consider how 
TensorFlow handles this (I think)

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h#L95

> [C++] Enable mmap option for LocalFs
> 
>
> Key: ARROW-7007
> URL: https://issues.apache.org/jira/browse/ARROW-7007
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-6980) [R] dplyr backend for RecordBatch/Table

2019-10-28 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6980.

Resolution: Fixed

Issue resolved by pull request 5661
[https://github.com/apache/arrow/pull/5661]

> [R] dplyr backend for RecordBatch/Table
> ---
>
> Key: ARROW-6980
> URL: https://issues.apache.org/jira/browse/ARROW-6980
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6784) [C++][R] Move filter and take code from Rcpp to C++ library

2019-10-28 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961441#comment-16961441
 ] 

Neal Richardson commented on ARROW-6784:


Followup issues: ARROW-6959, ARROW-7009, ARROW-7012

> [C++][R] Move filter and take code from Rcpp to C++ library
> ---
>
> Key: ARROW-6784
> URL: https://issues.apache.org/jira/browse/ARROW-6784
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-3808 and some other previous work. Of particular interest:
>  * Filter and Take methods for ChunkedArray, in r/src/compute.cpp
>  * Methods for that and some other things that apply Array and ChunkedArray 
> methods across the columns of a RecordBatch or Table, respectively
>  * RecordBatch__select and Table__select to take columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-6784) [C++][R] Move filter and take code from Rcpp to C++ library

2019-10-28 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6784:
---
Summary: [C++][R] Move filter and take code from Rcpp to C++ library  (was: 
[C++][R] Move filter, take, select C++ code from Rcpp to C++ library)

> [C++][R] Move filter and take code from Rcpp to C++ library
> ---
>
> Key: ARROW-6784
> URL: https://issues.apache.org/jira/browse/ARROW-6784
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-3808 and some other previous work. Of particular interest:
>  * Filter and Take methods for ChunkedArray, in r/src/compute.cpp
>  * Methods for that and some other things that apply Array and ChunkedArray 
> methods across the columns of a RecordBatch or Table, respectively
>  * RecordBatch__select and Table__select to take columns



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7012) [C++] Clarify ChunkedArray chunking strategy and policy

2019-10-28 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7012:
--

 Summary: [C++] Clarify ChunkedArray chunking strategy and policy
 Key: ARROW-7012
 URL: https://issues.apache.org/jira/browse/ARROW-7012
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Neal Richardson


See discussion on ARROW-6784 and [https://github.com/apache/arrow/pull/5686]. 
Among the questions:
 * Do Arrow users control the chunking, or is it an internal implementation 
detail they should not manage?
 * If users control it, how do they control it? E.g. if I call Take and use a 
ChunkedArray for the indices to take, does the chunking follow how the indices 
are chunked? Or should we attempt to preserve the mapping of data to their 
chunks in the input table/chunked array?
 * If it's an implementation detail, what is the optimal chunk size? And when 
is it worth reshaping (concatenating, slicing) input data to attain this 
optimal size? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7011) [C++] Implement casts from float/double to decimal128

2019-10-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7011:
---

 Summary: [C++] Implement casts from float/double to decimal128
 Key: ARROW-7011
 URL: https://issues.apache.org/jira/browse/ARROW-7011
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


see also ARROW-5905, ARROW-7010



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7010) [C++] Support lossy casts from decimal128 to float32 and float64/double

2019-10-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-7010:
---

 Summary: [C++] Support lossy casts from decimal128 to float32 and 
float64/double
 Key: ARROW-7010
 URL: https://issues.apache.org/jira/browse/ARROW-7010
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I do not believe such casts are implemented. This can be helpful for people 
analyzing data where the precision of decimal128 is not needed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7009) [C++] Refactor filter/take kernels to use Datum instead of overloads

2019-10-28 Thread Neal Richardson (Jira)

Neal Richardson created ARROW-7009:
--

 Summary: [C++] Refactor filter/take kernels to use Datum instead 
of overloads
 Key: ARROW-7009
 URL: https://issues.apache.org/jira/browse/ARROW-7009
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Neal Richardson
 Fix For: 1.0.0


Followup to ARROW-6784. See discussion on 
[https://github.com/apache/arrow/pull/5686,|https://github.com/apache/arrow/pull/5686]
 as well as ARROW-6959.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability

2019-10-28 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan resolved ARROW-7006.

Resolution: Fixed

Issue resolved by pull request 5744
[https://github.com/apache/arrow/pull/5744]

> [Rust] Bump flatbuffers version to avoid vulnerability
> --
>
> Key: ARROW-7006
> URL: https://issues.apache.org/jira/browse/ARROW-7006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.15.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> From GitHub use emilk:
> [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output:
>  
> {{ID:  RUSTSEC-2019-0028
> Crate: flatbuffers
> Version: 0.5.0
> Date:  2019-10-20
> URL:   https://github.com/google/flatbuffers/issues/5530
> Title: Unsound `impl Follow for bool`}}
> The fix should be as simple as editing 
> [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from 
> {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}}
> A more longterm improvement is to add a call to {{cargo audit}} in your CI to 
> catch these problems as early as possible
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317
 ] 

Tom Goodman edited comment on ARROW-6999 at 10/28/19 7:24 PM:
--

[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)._ 

This may be because the index on test3.hdf is Int64Index and I see [pyarrow 
docs|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas]
 say default behavior is to "store the index as a column", except for rage 
indexes.  This unfortunately makes the bug more prevalent.


was (Author: goodiegoodman):
[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)._ 

This may be because the index on test3.hdf is Int64Index and I see [pyarrow 
docs|#pyarrow.Table.from_pandas]] say default behavior is to "store the index 
as a column", except for rage indexes.  This unfortunately makes the bug more 
prevalent.

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return

[jira] [Updated] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability

2019-10-28 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan updated ARROW-7006:
---
  Component/s: Rust
Fix Version/s: 1.0.0

> [Rust] Bump flatbuffers version to avoid vulnerability
> --
>
> Key: ARROW-7006
> URL: https://issues.apache.org/jira/browse/ARROW-7006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.15.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> From GitHub use emilk:
> [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output:
>  
> {{ID:  RUSTSEC-2019-0028
> Crate: flatbuffers
> Version: 0.5.0
> Date:  2019-10-20
> URL:   https://github.com/google/flatbuffers/issues/5530
> Title: Unsound `impl Follow for bool`}}
> The fix should be as simple as editing 
> [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from 
> {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}}
> A more longterm improvement is to add a call to {{cargo audit}} in your CI to 
> catch these problems as early as possible
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317
 ] 

Tom Goodman edited comment on ARROW-6999 at 10/28/19 7:21 PM:
--

[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)._ 

This may be because the index on test3.hdf is Int64Index and I see [pyarrow 
docs|#pyarrow.Table.from_pandas]] say default behavior is to "store the index 
as a column", except for rage indexes.  This unfortunately makes the bug more 
prevalent.


was (Author: goodiegoodman):
[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)._ 

This may be because the index on test3.hdf is Int64Index and I see [pyarrow 
docs|[https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas]]
 say default behavior is to "store the index as a column", except for rage 
indexes)

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File

[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317
 ] 

Tom Goodman edited comment on ARROW-6999 at 10/28/19 7:15 PM:
--

[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)._ 

This may be because the index on test3.hdf is Int64Index and I see [pyarrow 
docs|[https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas]]
 say default behavior is to "store the index as a column", except for rage 
indexes)


was (Author: goodiegoodman):
[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)_, do you?

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
>

[jira] [Updated] (ARROW-6758) [Release] Install ephemeral node/npm/npx in release verification script

2019-10-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6758:
--
Labels: pull-request-available  (was: )

> [Release] Install ephemeral node/npm/npx in release verification script
> ---
>
> Key: ARROW-6758
> URL: https://issues.apache.org/jira/browse/ARROW-6758
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> Installing node with nvm isn't terribly difficult; to add this to the release 
> verification script would make it easier for people to verify more of the 
> release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-6758) [Release] Install ephemeral node/npm/npx in release verification script

2019-10-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6758:
---

Assignee: Wes McKinney

> [Release] Install ephemeral node/npm/npx in release verification script
> ---
>
> Key: ARROW-6758
> URL: https://issues.apache.org/jira/browse/ARROW-6758
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> Installing node with nvm isn't terribly difficult; to add this to the release 
> verification script would make it easier for people to verify more of the 
> release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317
 ] 

Tom Goodman edited comment on ARROW-6999 at 10/28/19 6:32 PM:
--

[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__' (+without+ specifying 
preserve_index)_, do you?


was (Author: goodiegoodman):
[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__', do you?

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call

[jira] [Comment Edited] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317
 ] 

Tom Goodman edited comment on ARROW-6999 at 10/28/19 6:13 PM:
--

[~jorisvandenbossche] please try this with the attached [^test3.hdf] (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__', do you?


was (Author: goodiegoodman):
[~jorisvandenbossche] please try this with the attached test3.hdf (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__', do you?

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
>

[jira] [Commented] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961317#comment-16961317
 ] 

Tom Goodman commented on ARROW-6999:


[~jorisvandenbossche] please try this with the attached test3.hdf (not empty)
{code:java}
df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema){code}
I still get KeyError: '__index_level_0__', do you?

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
>  line 3326, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 5, in 
> pa_table = pa.Table.from_pandas(df, 
> schema=pa.Table.from_pandas(df).schema)
>   File "pyarrow/table.pxi", line 1057, in

[jira] [Updated] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Tom Goodman (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Goodman updated ARROW-6999:
---
Attachment: test3.hdf

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
>  line 3326, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
>   File "", line 5, in 
> pa_table = pa.Table.from_pandas(df, 
> schema=pa.Table.from_pandas(df).schema)
>   File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 517, in dataframe_to_arrays
> columns)
>   File 
>

[jira] [Commented] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961145#comment-16961145
 ] 

Joris Van den Bossche commented on ARROW-6999:
--

So this case is clearly a bug in the new implementation, I would say:

{code}
In [23]: import pandas as pd 
...: import pyarrow as pa 
...: df = pd.DataFrame({'a': [1, 2, 3]})  
...: schema = pa.Table.from_pandas(df, preserve_index=True).schema 
...: pa_table = pa.Table.from_pandas(df, schema=schema, 
preserve_index=True)
   
...
KeyError: "name '__index_level_0__' present in the specified schema is not 
found in the columns or index"
{code}

So if you specify {{preserve_index=True}}, and there is an index in the schema 
that did not have a name in the DataFrame (so ending up as the generated 
{{\_\_index_level_i\_\_}}), the above should work when passing an explicit 
schema matching that.

Will look into fixing this (it's a pity that 0.15.1 is already released, it 
would have been nice to include this).

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
> values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
> loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi",

[jira] [Assigned] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn reassigned ARROW-7008:
---

Assignee: Uwe Korn

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7008:
--
Labels: pull-request-available  (was: )

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961084#comment-16961084
 ] 

Uwe Korn commented on ARROW-7008:
-

No, this is a different one and I can reproduce with 0.15 and master.

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Artem KOZHEVNIKOV (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079
 ] 

Artem KOZHEVNIKOV commented on ARROW-7008:
--

is 

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Artem KOZHEVNIKOV (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961079#comment-16961079
 ] 

Artem KOZHEVNIKOV edited comment on ARROW-7008 at 10/28/19 2:10 PM:


is it the same issue as https://issues.apache.org/jira/browse/ARROW-6857 ?


was (Author: artemk):
is 

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers

2019-10-28 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn updated ARROW-7008:

Summary: [Python] pyarrow.chunked_array([array]) fails on array with 
all-None buffers  (was: [Python] pyarrow.chunked_array([array]) fails on array 
with )

> [Python] pyarrow.chunked_array([array]) fails on array with all-None buffers
> 
>
> Key: ARROW-7008
> URL: https://issues.apache.org/jira/browse/ARROW-7008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Uwe Korn
>Priority: Major
>
> Minimal reproducer:
> {code}
> import pyarrow as pa
> pa.chunked_array([pa.array([], 
> type=pa.string()).dictionary_encode().dictionary])
> {code}
> Traceback
> {code}
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x20)
>   * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
> arrow::internal::ValidateVisitor::ValidateOffsets const>(arrow::BinaryArray const&) + 94
> frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
> arrow::VisitArrayInline(arrow::Array 
> const&, arrow::internal::ValidateVisitor*) + 915
> frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
> const + 829
> frame #3: 0x000112e3ea19 
> libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
> frame #4: 0x000112b8eb7d 
> lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
> _object*, _object*) + 3661
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7008) [Python] pyarrow.chunked_array([array]) fails on array with

2019-10-28 Thread Uwe Korn (Jira)

Uwe Korn created ARROW-7008:
---

 Summary: [Python] pyarrow.chunked_array([array]) fails on array 
with 
 Key: ARROW-7008
 URL: https://issues.apache.org/jira/browse/ARROW-7008
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
Reporter: Uwe Korn


Minimal reproducer:

{code}
import pyarrow as pa

pa.chunked_array([pa.array([], 
type=pa.string()).dictionary_encode().dictionary])
{code}

Traceback

{code}
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x20)
  * frame #0: 0x000112cd5d0e libarrow.15.dylib`arrow::Status 
arrow::internal::ValidateVisitor::ValidateOffsets(arrow::BinaryArray const&) + 94
frame #1: 0x000112cc79a3 libarrow.15.dylib`arrow::Status 
arrow::VisitArrayInline(arrow::Array const&, 
arrow::internal::ValidateVisitor*) + 915
frame #2: 0x000112cc747d libarrow.15.dylib`arrow::Array::Validate() 
const + 829
frame #3: 0x000112e3ea19 
libarrow.15.dylib`arrow::ChunkedArray::Validate() const + 89
frame #4: 0x000112b8eb7d 
lib.cpython-37m-darwin.so`__pyx_pw_7pyarrow_3lib_135chunked_array(_object*, 
_object*, _object*) + 3661
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7007) [C++] Enable mmap option for LocalFs

2019-10-28 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7007:
--
Component/s: C++

> [C++] Enable mmap option for LocalFs
> 
>
> Key: ARROW-7007
> URL: https://issues.apache.org/jira/browse/ARROW-7007
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7007) [C++] Enable mmap option for LocalFs

2019-10-28 Thread Francois Saint-Jacques (Jira)

Francois Saint-Jacques created ARROW-7007:
-

 Summary: [C++] Enable mmap option for LocalFs
 Key: ARROW-7007
 URL: https://issues.apache.org/jira/browse/ARROW-7007
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability

2019-10-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7006:
--
Labels: pull-request-available  (was: )

> [Rust] Bump flatbuffers version to avoid vulnerability
> --
>
> Key: ARROW-7006
> URL: https://issues.apache.org/jira/browse/ARROW-7006
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.15.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
>
> From GitHub use emilk:
> [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output:
>  
> {{ID:  RUSTSEC-2019-0028
> Crate: flatbuffers
> Version: 0.5.0
> Date:  2019-10-20
> URL:   https://github.com/google/flatbuffers/issues/5530
> Title: Unsound `impl Follow for bool`}}
> The fix should be as simple as editing 
> [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from 
> {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}}
> A more longterm improvement is to add a call to {{cargo audit}} in your CI to 
> catch these problems as early as possible
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability

2019-10-28 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan reassigned ARROW-7006:
--

Assignee: Paddy Horan

> [Rust] Bump flatbuffers version to avoid vulnerability
> --
>
> Key: ARROW-7006
> URL: https://issues.apache.org/jira/browse/ARROW-7006
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.15.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>
> From GitHub use emilk:
> [{{cargo audit}}|https://github.com/RustSec/cargo-audit] output:
>  
> {{ID:  RUSTSEC-2019-0028
> Crate: flatbuffers
> Version: 0.5.0
> Date:  2019-10-20
> URL:   https://github.com/google/flatbuffers/issues/5530
> Title: Unsound `impl Follow for bool`}}
> The fix should be as simple as editing 
> [https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from 
> {{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}}
> A more longterm improvement is to add a call to {{cargo audit}} in your CI to 
> catch these problems as early as possible
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7006) [Rust] Bump flatbuffers version to avoid vulnerability

2019-10-28 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-7006:
--

 Summary: [Rust] Bump flatbuffers version to avoid vulnerability
 Key: ARROW-7006
 URL: https://issues.apache.org/jira/browse/ARROW-7006
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.15.0
Reporter: Paddy Horan


>From GitHub use emilk:

[{{cargo audit}}|https://github.com/RustSec/cargo-audit] output:

 

{{ID:RUSTSEC-2019-0028
Crate:   flatbuffers
Version: 0.5.0
Date:2019-10-20
URL: https://github.com/google/flatbuffers/issues/5530
Title:   Unsound `impl Follow for bool`}}

The fix should be as simple as editing 
[https://github.com/apache/arrow/blob/master/rust/arrow/Cargo.toml] from 
{{flatbuffers = "0.5.0"}} to {{flatbuffers = "0.6.0"}}

A more longterm improvement is to add a call to {{cargo audit}} in your CI to 
catch these problems as early as possible

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7005) [Rust] run "cargo audit" in CI

2019-10-28 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-7005:
--

 Summary: [Rust] run "cargo audit" in CI
 Key: ARROW-7005
 URL: https://issues.apache.org/jira/browse/ARROW-7005
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6999) [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema

2019-10-28 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960931#comment-16960931
 ] 

Joris Van den Bossche commented on ARROW-6999:
--

[~goodiegoodman] thanks for the report!

Your "steps to reproduce" actually do work if you do not use an empty dataframe:

{code}
In [15]: import pandas as pd 
...: import pyarrow as pa 
...: df = pd.DataFrame({'a': [1, 2, 3]})  
...: schema = pa.Table.from_pandas(df).schema 
...: pa_table = pa.Table.from_pandas(df, schema=schema) 

   

In [16]: schema 

   
Out[16]: 
a: int64
metadata

{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev177+g5df424bd6"}, "pandas_version": "0.26.0.dev0+669'
b'.g3c29114b1"}'}
{code}

The empty dataframe is tricky edge-case regarding the index, because in such a 
case the index is not a RangeIndex but a empty object-dtype Index (see 
ARROW-5104 for a similar report about that aspect).  

That said, if passing an explicit schema, and if there is a column not found 
that has a "\_\_index_level_i\_\_" pattern, we should try to handle this 
(certainly in case of passing {{preserve_index=True}}).




> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> ---
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>Reporter: Tom Goodman
>Priority: Major
> Fix For: 1.0.0
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
> return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
> col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
> return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
> return self._get_item_cache(key)
>   File 
>

[jira] [Commented] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

2019-10-28 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960912#comment-16960912
 ] 

Joris Van den Bossche commented on ARROW-5379:
--

So the pandas -> arrow/feather conversion already works with pandas master and 
the latest Arrow release (0.15).

If you want to use this feature without relying on pandas master, you can use 
this monkeypatch (it's basically what is added in the development version of 
pandas master):

{code}
import pandas as pd
import pyarrow

pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: 
pyarrow.array(self._data, mask=self._mask, type=type)
{code}

> [Python] support pandas' nullable Integer type in from_pandas
> -
>
> Key: ARROW-5379
> URL: https://issues.apache.org/jira/browse/ARROW-5379
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From https://github.com/apache/arrow/issues/4168. We should add support for 
> pandas' nullable Integer extension dtypes, as those could map nicely to 
> arrows integer types. 
> Ideally this happens in a generic way though, and not specific for this 
> extension type, which is discussed in ARROW-5271



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7002) Support pandas nullable integer type Int64

2019-10-28 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-7002.

Resolution: Duplicate

Closing as a duplicate of ARROW-5379

> Support pandas nullable integer type Int64
> --
>
> Key: ARROW-7002
> URL: https://issues.apache.org/jira/browse/ARROW-7002
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christian Roth
>Priority: Major
>
> Pandas has a nullable integer type Int64 which does not seem to be supported 
> by feather yet.
> {code:python}
> from pyarrow import feather
> import pandas as pd
> col1 = pd.Series([0, None, 1, 23]).astype('Int64')
> col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
> df = pd.DataFrame({'a': col1, 'b': col2})
> feather.write_feather(df, '/tmp/foo')
> {code}
> Gives following error message:
> {code:java}
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 feather.write_feather(df, '/tmp/foo')
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write_feather(df, dest)
> 181 writer = FeatherWriter(dest)
> 182 try:
> --> 183 writer.write(df)
> 184 except Exception:
> 185 # Try to make sure the resource is closed
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write(self, df)
>  92 # TODO(wesm): Remove this length check, see ARROW-1732
>  93 if len(df.columns) > 0:
> ---> 94 table = Table.from_pandas(df, preserve_index=False)
>  95 for i, name in enumerate(table.schema.names):
>  96 col = table[i]
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 551 if nthreads == 1:
> 552 arrays = [convert_column(c, f)
> --> 553   for c, f in zip(columns_to_convert, convert_fields)]
> 554 else:
> 555 from concurrent import futures
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in (.0)
> 551 if nthreads == 1:
> 552 arrays = [convert_column(c, f)
> --> 553   for c, f in zip(columns_to_convert, convert_fields)]
> 554 else:
> 555 from concurrent import futures
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in convert_column(col, field)
> 542 e.args += ("Conversion failed for column {0!s} with type 
> {1!s}"
> 543.format(col.name, col.dtype),)
> --> 544 raise e
> 545 if not field_nullable and result.null_count > 0:
> 546 raise ValueError("Field {} was non-nullable but pandas 
> column "
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in convert_column(col, field)
> 536 
> 537 try:
> --> 538 result = pa.array(col, type=type_, from_pandas=True, 
> safe=safe)
> 539 except (pa.ArrowInvalid,
> 540 pa.ArrowNotImplementedError,
> ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for 
> column a with type Int64')
> {code}
> xref: 
> [https://stackoverflow.com/questions/58571419/exporting-dataframe-with-null-able-int64-from-pandas-to-r]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7002) Support pandas nullable integer type Int64

2019-10-28 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960902#comment-16960902
 ] 

Joris Van den Bossche edited comment on ARROW-7002 at 10/28/19 10:18 AM:
-

Writing is already supported with pandas master and latest arrow (v0.15), so it 
is waiting on the next pandas release to have it in a stable version.

{code}
In [1]: from pyarrow import feather 
   ...: import pandas as pd 
   ...:  
   ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') 
   ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') 
   ...:  
   ...: df = pd.DataFrame({'a': col1, 'b': col2}) 
   ...:  
   ...: feather.write_feather(df, '/tmp/foo') 
   ...: 

   

In [2]: pd.read_feather('/tmp/foo') 

   
Out[2]: 
  a  b
0   0.0  1
1   NaN  3
2   1.0  2
3  23.0  1
{code}

So converting to R should work properly. Reading it back in with Python will 
still give you a float array (if there were NaNs), as that is the default 
conversion of arrow integer to pandas. There is work going on to also preserve 
those specific pandas types in that case (see ARROW-2428).


was (Author: jorisvandenbossche):
Writing is already supported with pandas master and latest arrow (0.15), so it 
is waiting on the next pandas release to have it in a stable version.

{code}
In [1]: from pyarrow import feather 
   ...: import pandas as pd 
   ...:  
   ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') 
   ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') 
   ...:  
   ...: df = pd.DataFrame({'a': col1, 'b': col2}) 
   ...:  
   ...: feather.write_feather(df, '/tmp/foo') 
   ...: 

   

In [2]: pd.read_feather('/tmp/foo') 

   
Out[2]: 
  a  b
0   0.0  1
1   NaN  3
2   1.0  2
3  23.0  1
{code}

Reading it back in will still give you a float array (if there were NaNs), as 
that is the default conversion of arrow integer to pandas. There is work going 
on to also preserve those specific pandas types in that case (see ARROW-2428).

> Support pandas nullable integer type Int64
> --
>
> Key: ARROW-7002
> URL: https://issues.apache.org/jira/browse/ARROW-7002
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christian Roth
>Priority: Major
>
> Pandas has a nullable integer type Int64 which does not seem to be supported 
> by feather yet.
> {code:python}
> from pyarrow import feather
> import pandas as pd
> col1 = pd.Series([0, None, 1, 23]).astype('Int64')
> col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
> df = pd.DataFrame({'a': col1, 'b': col2})
> feather.write_feather(df, '/tmp/foo')
> {code}
> Gives following error message:
> {code:java}
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 feather.write_feather(df, '/tmp/foo')
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write_feather(df, dest)
> 181 writer = FeatherWriter(dest)
> 182 try:
> --> 183 writer.write(df)
> 184 except Exception:
> 185 # Try to make sure the resource is closed
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write(self, df)
>  92 # TODO(wesm): Remove this length check, see ARROW-1732
>  93 if len(df.columns) > 0:
> ---> 94 table = Table.from_pandas(df, preserve_index=False)
>  95 for i, name in enumerate(table.schema.names):
>  96 col = table[i]
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 551 if nthreads == 1:
> 552 arrays = [convert_column(c, f)
> --> 553   for c, f in zip(columns_to_convert, convert_fields)]
> 554 else:
> 555 from concurrent import futures
>

[jira] [Commented] (ARROW-7002) Support pandas nullable integer type Int64

2019-10-28 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960902#comment-16960902
 ] 

Joris Van den Bossche commented on ARROW-7002:
--

Writing is already supported with pandas master and latest arrow (0.15), so it 
is waiting on the next pandas release to have it in a stable version.

{code}
In [1]: from pyarrow import feather 
   ...: import pandas as pd 
   ...:  
   ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') 
   ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') 
   ...:  
   ...: df = pd.DataFrame({'a': col1, 'b': col2}) 
   ...:  
   ...: feather.write_feather(df, '/tmp/foo') 
   ...: 

   

In [2]: pd.read_feather('/tmp/foo') 

   
Out[2]: 
  a  b
0   0.0  1
1   NaN  3
2   1.0  2
3  23.0  1
{code}

Reading it back in will still give you a float array (if there were NaNs), as 
that is the default conversion of arrow integer to pandas. There is work going 
on to also preserve those specific pandas types in that case (see ARROW-2428).

> Support pandas nullable integer type Int64
> --
>
> Key: ARROW-7002
> URL: https://issues.apache.org/jira/browse/ARROW-7002
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christian Roth
>Priority: Major
>
> Pandas has a nullable integer type Int64 which does not seem to be supported 
> by feather yet.
> {code:python}
> from pyarrow import feather
> import pandas as pd
> col1 = pd.Series([0, None, 1, 23]).astype('Int64')
> col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
> df = pd.DataFrame({'a': col1, 'b': col2})
> feather.write_feather(df, '/tmp/foo')
> {code}
> Gives following error message:
> {code:java}
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 feather.write_feather(df, '/tmp/foo')
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write_feather(df, dest)
> 181 writer = FeatherWriter(dest)
> 182 try:
> --> 183 writer.write(df)
> 184 except Exception:
> 185 # Try to make sure the resource is closed
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write(self, df)
>  92 # TODO(wesm): Remove this length check, see ARROW-1732
>  93 if len(df.columns) > 0:
> ---> 94 table = Table.from_pandas(df, preserve_index=False)
>  95 for i, name in enumerate(table.schema.names):
>  96 col = table[i]
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 551 if nthreads == 1:
> 552 arrays = [convert_column(c, f)
> --> 553   for c, f in zip(columns_to_convert, convert_fields)]
> 554 else:
> 555 from concurrent import futures
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in (.0)
> 551 if nthreads == 1:
> 552 arrays = [convert_column(c, f)
> --> 553   for c, f in zip(columns_to_convert, convert_fields)]
> 554 else:
> 555 from concurrent import futures
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in convert_column(col, field)
> 542 e.args += ("Conversion failed for column {0!s} with type 
> {1!s}"
> 543.format(col.name, col.dtype),)
> --> 544 raise e
> 545 if not field_nullable and result.null_count > 0:
> 546 raise ValueError("Field {} was non-nullable but pandas 
> column "
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in convert_column(col, field)
> 536 
> 537 try:
> --> 538 result = pa.array(col, type=type_, from_pandas=True, 
> safe=safe)
> 539 except (pa.ArrowInvalid,
> 540 pa.ArrowNotImplementedError,
> ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for 
> column a with type Int64')
> {code}
> xref: 
> [https://stackoverflow.com/questions/58571419/exporting-dataframe-with-null-able-int64-from-pandas-to-r]
>   



--
This message was sent by Atlassian Jira

[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-28 Thread Casey (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960896#comment-16960896
 ] 

Casey commented on ARROW-6985:
--

So it sounds like this is just a known use case where parquet is not well 
suited. For my own knowledge, why exactly is the heap fragmenting? Shouldn't 
the heap allocation just grab the same memory that was used in the previous 
iteration?

 

Anyway, happy to have the issue closed as not needed and I'll restructure our 
data to work within these limitations.

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7004) [Plasma] Make it possible to bump up object in LRU cache

2019-10-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7004:
--
Labels: pull-request-available  (was: )

> [Plasma] Make it possible to bump up object in LRU cache
> 
>
> Key: ARROW-7004
> URL: https://issues.apache.org/jira/browse/ARROW-7004
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>
> To avoid evicting objects too early, we sometimes want to bump up a number of 
> objects up in the LRU cache. While it would be possible to call Get() on 
> these objects, this can be undesirable, since it is blocking on the objects 
> if they are not available and will make it necessary to call Release() on 
> them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7004) [Plasma] Make it possible to bump up object in LRU cache

2019-10-28 Thread Philipp Moritz (Jira)

Philipp Moritz created ARROW-7004:
-

 Summary: [Plasma] Make it possible to bump up object in LRU cache
 Key: ARROW-7004
 URL: https://issues.apache.org/jira/browse/ARROW-7004
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Philipp Moritz
Assignee: Philipp Moritz


To avoid evicting objects too early, we sometimes want to bump up a number of 
objects up in the LRU cache. While it would be possible to call Get() on these 
objects, this can be undesirable, since it is blocking on the objects if they 
are not available and will make it necessary to call Release() on them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-4223) [Python] Support scipy.sparse integration

2019-10-28 Thread Rok Mihevc (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-4223:
--
Fix Version/s: 1.0.0

> [Python] Support scipy.sparse integration
> -
>
> Key: ARROW-4223
> URL: https://issues.apache.org/jira/browse/ARROW-4223
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Kenta Murata
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: pull-request-available, sparse
> Fix For: 1.0.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> It would be great to support integration with scipy.sparse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

61 matches

Mail list logo