[jira] [Resolved] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-08 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6155.

   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5028
[https://github.com/apache/arrow/pull/5028]

> [Java] Extract a super interface for vectors whose elements reside in 
> continuous memory segments
> 
>
> Key: ARROW-6155
> URL: https://issues.apache.org/jira/browse/ARROW-6155
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vectors whose data elements reside in continuous memory segments, they 
> should implement a common super interface. This will avoid unnecessary code 
> branches.
> For now, such vectors include fixed-width vectors and variable-width vectors. 
> In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-08 Thread Micah Kornfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6155:
---
Component/s: Java

> [Java] Extract a super interface for vectors whose elements reside in 
> continuous memory segments
> 
>
> Key: ARROW-6155
> URL: https://issues.apache.org/jira/browse/ARROW-6155
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vectors whose data elements reside in continuous memory segments, they 
> should implement a common super interface. This will avoid unnecessary code 
> branches.
> For now, such vectors include fixed-width vectors and variable-width vectors. 
> In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6183) [R] factor out tidyselect?

2019-08-08 Thread James Lamb (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903540#comment-16903540
 ] 

James Lamb commented on ARROW-6183:
---

I'm also not trying to take sides or force anyone to take sides. In my opinion 
saying "pass a vector of column names" would be preferable to the current state 
which is "pass this object from *tidyselect* package but also a vector of 
column names will work".

And then outside of *feather::read_feather()*, I'm unsure why *arrow* needs to 
re-export all of these functions 
([https://github.com/apache/arrow/blob/master/r/R/reexports-tidyselect.R]) or 
have *tidyselect* as an "Imports" dependency.

> [R] factor out tidyselect?
> --
>
> Key: ARROW-6183
> URL: https://issues.apache.org/jira/browse/ARROW-6183
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Reporter: James Lamb
>Priority: Minor
>
> I noticed tonight that several functions from the *tidyselect* package are 
> re-exported by *arrow*. Why is this necessary? In my opinion, the *arrow* R 
> package should strive to have as few dependencies as possible and should have 
> no opinion about which parts of the R ecosystem ("tidy" or otherwise) are 
> used with it.
> I think it would be valuable to cut the *tidyselect* re-exports, and to make 
> *feather::read_feather()*'s argument *col_select* take a character vector of 
> column names instead of a "*tidyselect::vars_select()"* object. I think that 
> would be more natural and would be intuitive for a broader group of R users.
> Would you be open to removing *tidyselect* and changing 
> *feather::read_feather()* this way?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6183) [R] factor out tidyselect?

2019-08-08 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903539#comment-16903539
 ] 

Neal Richardson commented on ARROW-6183:


The col_select argument already works with a character vector of column names:
{code:java}
library(arrow)
f <- tempfile() 
write.csv(iris, file=f)
df <- read_csv_arrow(f, col_select=c("Sepal.Length", "Species"))
> head(df)
   Sepal.Length Species
 1          5.1  setosa
 2          4.9  setosa
 3          4.7  setosa
 4          4.6  setosa
 5          5.0  setosa
 6          5.4  setosa
{code}
Perhaps we could improve the documentation to make that explicit.

I'm generally in favor of minimizing dependencies (I was fine with dropping 
tibble FWIW), but the tidyselect dependency is not heavy given what other 
dependencies arrow already requires. And since it doesn't require you to be 
"tidy", including it does not force anyone to take a side in an internecine 
language war. So personally I'm -0 on removing it.

> [R] factor out tidyselect?
> --
>
> Key: ARROW-6183
> URL: https://issues.apache.org/jira/browse/ARROW-6183
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Reporter: James Lamb
>Priority: Minor
>
> I noticed tonight that several functions from the *tidyselect* package are 
> re-exported by *arrow*. Why is this necessary? In my opinion, the *arrow* R 
> package should strive to have as few dependencies as possible and should have 
> no opinion about which parts of the R ecosystem ("tidy" or otherwise) are 
> used with it.
> I think it would be valuable to cut the *tidyselect* re-exports, and to make 
> *feather::read_feather()*'s argument *col_select* take a character vector of 
> column names instead of a "*tidyselect::vars_select()"* object. I think that 
> would be more natural and would be intuitive for a broader group of R users.
> Would you be open to removing *tidyselect* and changing 
> *feather::read_feather()* this way?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6183) [R] factor out tidyselect?

2019-08-08 Thread James Lamb (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Lamb updated ARROW-6183:
--
Component/s: R

> [R] factor out tidyselect?
> --
>
> Key: ARROW-6183
> URL: https://issues.apache.org/jira/browse/ARROW-6183
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: R
>Reporter: James Lamb
>Priority: Minor
>
> I noticed tonight that several functions from the *tidyselect* package are 
> re-exported by *arrow*. Why is this necessary? In my opinion, the *arrow* R 
> package should strive to have as few dependencies as possible and should have 
> no opinion about which parts of the R ecosystem ("tidy" or otherwise) are 
> used with it.
> I think it would be valuable to cut the *tidyselect* re-exports, and to make 
> *feather::read_feather()*'s argument *col_select* take a character vector of 
> column names instead of a "*tidyselect::vars_select()"* object. I think that 
> would be more natural and would be intuitive for a broader group of R users.
> Would you be open to removing *tidyselect* and changing 
> *feather::read_feather()* this way?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6183) [R] factor out tidyselect?

2019-08-08 Thread James Lamb (JIRA)
James Lamb created ARROW-6183:
-

 Summary: [R] factor out tidyselect?
 Key: ARROW-6183
 URL: https://issues.apache.org/jira/browse/ARROW-6183
 Project: Apache Arrow
  Issue Type: Wish
Reporter: James Lamb


I noticed tonight that several functions from the *tidyselect* package are 
re-exported by *arrow*. Why is this necessary? In my opinion, the *arrow* R 
package should strive to have as few dependencies as possible and should have 
no opinion about which parts of the R ecosystem ("tidy" or otherwise) are used 
with it.

I think it would be valuable to cut the *tidyselect* re-exports, and to make 
*feather::read_feather()*'s argument *col_select* take a character vector of 
column names instead of a "*tidyselect::vars_select()"* object. I think that 
would be more natural and would be intuitive for a broader group of R users.

Would you be open to removing *tidyselect* and changing 
*feather::read_feather()* this way?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6160) [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex child vectors

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6160.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5031
[https://github.com/apache/arrow/pull/5031]

> [Java] AbstractStructVector#getPrimitiveVectors fails to work with complex 
> child vectors
> 
>
> Key: ARROW-6160
> URL: https://issues.apache.org/jira/browse/ARROW-6160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently in {{AbstractStructVector#getPrimitiveVectors}}, only struct type 
> child vectors will recursively get primitive vectors, other complex type like 
> {{ListVector}}, {{UnionVector}} was treated as primitive type and return 
> directly.
> For example, Struct(List(Int), Struct(Int, Varchar)) {{getPrimitiveVectors}} 
> should return {{[IntVector, IntVector, VarCharVector]}} instead of 
> [ListVector, IntVector, VarCharVector]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6166) [Go] Slice of slice causes index out of range panic

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6166:
---

Assignee: Roshan Kumaraswamy

> [Go] Slice of slice causes index out of range panic
> ---
>
> Key: ARROW-6166
> URL: https://issues.apache.org/jira/browse/ARROW-6166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Roshan Kumaraswamy
>Assignee: Roshan Kumaraswamy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When slicing a slice, the offset of the underlying data will cause an index 
> out of range panic if the offset if greater than the slice length. See 
> [https://github.com/apache/arrow/issues/5033]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6166) [Go] Slice of slice causes index out of range panic

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6166.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5035
[https://github.com/apache/arrow/pull/5035]

> [Go] Slice of slice causes index out of range panic
> ---
>
> Key: ARROW-6166
> URL: https://issues.apache.org/jira/browse/ARROW-6166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Roshan Kumaraswamy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When slicing a slice, the offset of the underlying data will cause an index 
> out of range panic if the offset if greater than the slice length. See 
> [https://github.com/apache/arrow/issues/5033]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6082) [Python] create pa.dictionary() type with non-integer indices type crashes

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6082.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5041
[https://github.com/apache/arrow/pull/5041]

> [Python] create pa.dictionary() type with non-integer indices type crashes
> --
>
> Key: ARROW-6082
> URL: https://issues.apache.org/jira/browse/ARROW-6082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> For example if you mixed the order of the indices and values type:
> {code}
> In [1]: pa.dictionary(pa.int8(), pa.string()) 
>   
>
> Out[1]: DictionaryType(dictionary)
> In [2]: pa.dictionary(pa.string(), pa.int8()) 
>   
>
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0731 14:40:42.748589 26310 type.cc:440]  Check failed: 
> is_integer(index_type->id()) dictionary index type should be signed integer
> *** Check failure stack trace: ***
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6152.
-
Resolution: Fixed

Issue resolved by pull request 5036
[https://github.com/apache/arrow/pull/5036]

> [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
> -
>
> Key: ARROW-6152
> URL: https://issues.apache.org/jira/browse/ARROW-6152
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> This is an initial refactoring task to enable the Arrow write layer to access 
> some of the internal implementation details of 
> {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6182) [R] Package fails to load with error `CXXABI_1.3.11' not found

2019-08-08 Thread Ian Cook (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903400#comment-16903400
 ] 

Ian Cook commented on ARROW-6182:
-

Setting/unsetting {{ARROW_USE_OLD_CXXABI}} made no difference.

The R package on conda-forge worked for me with no problems. Thanks!

I'm still curious what the cause of this error might be and whether it will 
affect anyone else, but it's no longer blocking me.

> [R] Package fails to load with error `CXXABI_1.3.11' not found 
> ---
>
> Key: ARROW-6182
> URL: https://issues.apache.org/jira/browse/ARROW-6182
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04.6
>Reporter: Ian Cook
>Priority: Major
>
> I'm able to successfully install the C++ and Python libraries from 
> conda-forge, then successfully install the R package from CRAN if I use 
> {{--no-test-load}}. But after installation, the R package fails to load 
> because {{dyn.load("arrow.so")}} fails. It throws this error when loading:
> {code:java}
> unable to load shared object '~/R/arrow/libs/arrow.so':
>  /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found 
> (required by ~/.conda/envs/python3.6/lib/libarrow.so.14)
> {code}
> Do the Arrow C++ libraries actually require GCC 7.1.0 / CXXABI_1.3.11? If 
> not, what might explain this error message? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6182) [R] Package fails to load with error `CXXABI_1.3.11' not found

2019-08-08 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903387#comment-16903387
 ] 

Neal Richardson commented on ARROW-6182:


The arrow R package is available on conda-forge as well–maybe that's better for 
you since you're already in conda territory? 
[https://anaconda.org/conda-forge/r-arrow]

I also notice this environment variable mentioned in the {{configure}} script, 
and maybe this is what it's there for: 
[https://github.com/apache/arrow/blob/13f5e92b87a87669a8fd15c457140dd098408fce/r/configure#L90-L93]

Let us know if that does turn out to solve it for you so we can add to the docs.

> [R] Package fails to load with error `CXXABI_1.3.11' not found 
> ---
>
> Key: ARROW-6182
> URL: https://issues.apache.org/jira/browse/ARROW-6182
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04.6
>Reporter: Ian Cook
>Priority: Major
>
> I'm able to successfully install the C++ and Python libraries from 
> conda-forge, then successfully install the R package from CRAN if I use 
> {{--no-test-load}}. But after installation, the R package fails to load 
> because {{dyn.load("arrow.so")}} fails. It throws this error when loading:
> {code:java}
> unable to load shared object '~/R/arrow/libs/arrow.so':
>  /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found 
> (required by ~/.conda/envs/python3.6/lib/libarrow.so.14)
> {code}
> Do the Arrow C++ libraries actually require GCC 7.1.0 / CXXABI_1.3.11? If 
> not, what might explain this error message? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6182) [R] Package fails to load with error `CXXABI_1.3.11' not found

2019-08-08 Thread Ian Cook (JIRA)
Ian Cook created ARROW-6182:
---

 Summary: [R] Package fails to load with error `CXXABI_1.3.11' not 
found 
 Key: ARROW-6182
 URL: https://issues.apache.org/jira/browse/ARROW-6182
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.14.1
 Environment: Ubuntu 16.04.6
Reporter: Ian Cook


I'm able to successfully install the C++ and Python libraries from conda-forge, 
then successfully install the R package from CRAN if I use {{--no-test-load}}. 
But after installation, the R package fails to load because 
{{dyn.load("arrow.so")}} fails. It throws this error when loading:
{code:java}
unable to load shared object '~/R/arrow/libs/arrow.so':
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found 
(required by ~/.conda/envs/python3.6/lib/libarrow.so.14)
{code}
Do the Arrow C++ libraries actually require GCC 7.1.0 / CXXABI_1.3.11? If not, 
what might explain this error message? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6181) [R] Only allow R package to install without libarrow on linux

2019-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6181:
--
Labels: pull-request-available  (was: )

> [R] Only allow R package to install without libarrow on linux
> -
>
> Key: ARROW-6181
> URL: https://issues.apache.org/jira/browse/ARROW-6181
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> See https://issues.apache.org/jira/browse/ARROW-6167 for backstory. Now that 
> we're on CRAN, we can be less paranoid about build failures getting the 
> package rejected, and we can focus on solidifying the CRAN binary package 
> experience. The macOS binaries for 0.14.1 were built without the C++ library, 
> which we did not expect and cannot reproduce. At this point, it would 
> probably be better to have a failed build than have binaries get made but be 
> useless. Plus, word has it that for macOS binary builds, CRAN will retry if 
> they fail for some reason. It's possible that whatever failed for 0.14.1 was 
> transient, and if the build had failed instead of carried on without 
> libarrow, on retry it may have built successfully.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6181) [R] Only allow R package to install without libarrow on linux

2019-08-08 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6181:
--

 Summary: [R] Only allow R package to install without libarrow on 
linux
 Key: ARROW-6181
 URL: https://issues.apache.org/jira/browse/ARROW-6181
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


See https://issues.apache.org/jira/browse/ARROW-6167 for backstory. Now that 
we're on CRAN, we can be less paranoid about build failures getting the package 
rejected, and we can focus on solidifying the CRAN binary package experience. 
The macOS binaries for 0.14.1 were built without the C++ library, which we did 
not expect and cannot reproduce. At this point, it would probably be better to 
have a failed build than have binaries get made but be useless. Plus, word has 
it that for macOS binary builds, CRAN will retry if they fail for some reason. 
It's possible that whatever failed for 0.14.1 was transient, and if the build 
had failed instead of carried on without libarrow, on retry it may have built 
successfully.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6153) [R] Address parquet deprecation warning

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6153:
---

Assignee: Wes McKinney  (was: Romain François)

> [R] Address parquet deprecation warning
> ---
>
> Key: ARROW-6153
> URL: https://issues.apache.org/jira/browse/ARROW-6153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
>
> [~wesmckinn] has been refactoring the Parquet C++ library and there's now 
> this deprecation warning appearing when I build the R package locally: 
> {code:java}
> clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" 
> -DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW 
> -I"/Users/enpiar/R/Rcpp/include" -isysroot 
> /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include  
> -fPIC  -Wall -g -O2  -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 
> 'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder      
>  [-Wdeprecated-declarations]       parquet::arrow::OpenFile(file, 
> arrow::default_memory_pool(), *props, ));                       ^
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6153) [R] Address parquet deprecation warning

2019-08-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903306#comment-16903306
 ] 

Wes McKinney commented on ARROW-6153:
-

I am fixing this in my patch for ARROW-6152

> [R] Address parquet deprecation warning
> ---
>
> Key: ARROW-6153
> URL: https://issues.apache.org/jira/browse/ARROW-6153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
>
> [~wesmckinn] has been refactoring the Parquet C++ library and there's now 
> this deprecation warning appearing when I build the R package locally: 
> {code:java}
> clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" 
> -DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW 
> -I"/Users/enpiar/R/Rcpp/include" -isysroot 
> /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include  
> -fPIC  -Wall -g -O2  -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 
> 'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder      
>  [-Wdeprecated-declarations]       parquet::arrow::OpenFile(file, 
> arrow::default_memory_pool(), *props, ));                       ^
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-6180) [C++] Create InputStream that references a segment of a RandomAccessFile

2019-08-08 Thread Zherui Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zherui Cao reassigned ARROW-6180:
-

Assignee: Zherui Cao

> [C++] Create InputStream that references a segment of a RandomAccessFile
> 
>
> Key: ARROW-6180
> URL: https://issues.apache.org/jira/browse/ARROW-6180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Zherui Cao
>Priority: Major
>
> If different threads wants to do buffered reads over different portions of a 
> file (and they are unable to create their own separate file handles), they 
> may clobber each other. I would propose creating an object that keeps the 
> RandomAccessFile internally and implements the InputStream API in a way that 
> is safe from other threads changing the file position



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-08 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903265#comment-16903265
 ] 

Micah Kornfield commented on ARROW-6179:


How would the two options be chosen?

> [C++] ExtensionType subclass for "unknown" types?
> -
>
> Key: ARROW-6179
> URL: https://issues.apache.org/jira/browse/ARROW-6179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++, when receiving IPC with extension type metadata for a type that is 
> unknown (the name is not registered), we currently fall back to returning the 
> "raw" storage array. The custom metadata (extension name and metadata) is 
> still available in the Field metadata.
> Alternatively, we could also have a generic {{ExtensionType}} class that can 
> hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
> {{GenericExtensionType}}), keeping the extension name and metadata in the 
> Array's type. 
> This could be a single class where several instances can be created given a 
> storage type, extension name and optionally extension metadata. It would be a 
> way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6180) [C++] Create InputStream that references a segment of a RandomAccessFile

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6180:

Summary: [C++] Create InputStream that references a segment of a 
RandomAccessFile  (was: [C++] Create InputStream that references a read-only 
segment of a RandomAccessFile)

> [C++] Create InputStream that references a segment of a RandomAccessFile
> 
>
> Key: ARROW-6180
> URL: https://issues.apache.org/jira/browse/ARROW-6180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> If different threads wants to do buffered reads over different portions of a 
> file (and they are unable to create their own separate file handles), they 
> may clobber each other. I would propose creating an object that keeps the 
> RandomAccessFile internally and implements the InputStream API in a way that 
> is safe from other threads changing the file position



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6180) [C++] Create InputStream that references a read-only segment of a RandomAccessFile

2019-08-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6180:
---

 Summary: [C++] Create InputStream that references a read-only 
segment of a RandomAccessFile
 Key: ARROW-6180
 URL: https://issues.apache.org/jira/browse/ARROW-6180
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


If different threads wants to do buffered reads over different portions of a 
file (and they are unable to create their own separate file handles), they may 
clobber each other. I would propose creating an object that keeps the 
RandomAccessFile internally and implements the InputStream API in a way that is 
safe from other threads changing the file position



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6173.
---

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5579) [Java] shade flatbuffer dependency

2019-08-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-5579.
-
   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 4701
[https://github.com/apache/arrow/pull/4701]

> [Java] shade flatbuffer dependency
> --
>
> Key: ARROW-5579
> URL: https://issues.apache.org/jira/browse/ARROW-5579
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20h 40m
>  Remaining Estimate: 0h
>
> Reported in a [github issue|[https://github.com/apache/arrow/issues/4489]] 
>  
> After some [discussion|https://github.com/google/flatbuffers/issues/5368] 
> with the Flatbuffers maintainer, it appears that FB generated code is not 
> guaranteed to be compatible with _any other_ version of the runtime library 
> other than the exact same version of the flatc used to compile it.
> This makes depending on flatbuffers in a library (like arrow) quite risky, as 
> if an app depends on any other version of FB, either directly or 
> transitively, it's likely the versions will clash at some point and you'll 
> see undefined behaviour at runtime.
> Shading the dependency looks to me the best way to avoid this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6041) [Website] Blog post announcing R package release

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6041.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 4948
[https://github.com/apache/arrow/pull/4948]

> [Website] Blog post announcing R package release
> 
>
> Key: ARROW-6041
> URL: https://issues.apache.org/jira/browse/ARROW-6041
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R, Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> After the R package makes it to CRAN, we should announce it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6178) [Developer] Don't fail in merge script on bad primary author input in multi-author PRs

2019-08-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6178:
---

 Summary: [Developer] Don't fail in merge script on bad primary 
author input in multi-author PRs
 Key: ARROW-6178
 URL: https://issues.apache.org/jira/browse/ARROW-6178
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Wes McKinney


I was going on autopilot in a multi-author PR and this happened

{code}
Switched to branch 'PR_TOOL_MERGE_PR_5000_MASTER'
Automatic merge went well; stopped before committing as requested
Author 1: François Saint-Jacques 
Author 2: Wes McKinney 
Enter primary author in the format of "name " [François Saint-Jacques 
]: y
fatal: --author '"y"' is not 'Name ' and matches no existing author
Command failed: ['git', 'commit', '--no-verify', '--author="y"', '-m', 
'ARROW-6121: [Tools] Improve merge tool ergonomics', '-m', '- merge_arrow_pr.py 
now accepts the pull-request number as a single optional argument, e.g. 
`./merge_arrow_pr.py 4921`.\r\n- merge_arrow_pr.py can optionally read a 
configuration file located in   `~/.config/arrow/merge.conf` which contains 
options like jira credentials. See the `dev/merge.conf` file as example', '-m', 
'Closes #5000 from fsaintjacques/ARROW-6121-merge-ergonomic and squashes the 
following commits:', '-m', '5298308d7  Handle username/password 
separately (in case username is set but not password)\n581653735  Rename merge.conf to merge.conf.sample\n7c51ca8f0  Add license to config file\n1213946bd  
ARROW-6121:  Improve merge tool ergonomics', '-m', 'Lead-authored-by: 
y\nCo-authored-by: François Saint-Jacques 
\nCo-authored-by: Wes McKinney 
\nSigned-off-by: Wes McKinney ']
With output:
--
b''
--
Traceback (most recent call last):
  File "dev/merge_arrow_pr.py", line 530, in 
if pr.is_merged:
  File "dev/merge_arrow_pr.py", line 515, in cli
PROJECT_NAME = os.environ.get('ARROW_PROJECT_NAME') or 'arrow'
  File "dev/merge_arrow_pr.py", line 420, in merge
'--author="%s"' % primary_author] +
  File "dev/merge_arrow_pr.py", line 89, in run_cmd
print('--')
  File "dev/merge_arrow_pr.py", line 81, in run_cmd
try:
  File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/subprocess.py", line 
395, in check_output
**kwargs).stdout
  File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/subprocess.py", line 
487, in run
output=stdout, stderr=stderr)
{code}

If the input does not match the expected format, we should loop to request 
input again rather than failing out (which requires messy manual cleanup of 
temporary branches)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6179:


 Summary: [C++] ExtensionType subclass for "unknown" types?
 Key: ARROW-6179
 URL: https://issues.apache.org/jira/browse/ARROW-6179
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche


In C++, when receiving IPC with extension type metadata for a type that is 
unknown (the name is not registered), we currently fall back to returning the 
"raw" storage array. The custom metadata (extension name and metadata) is still 
available in the Field metadata.

Alternatively, we could also have a generic {{ExtensionType}} class that can 
hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
{{GenericExtensionType}}), keeping the extension name and metadata in the 
Array's type. 

This could be a single class where several instances can be created given a 
storage type, extension name and optionally extension metadata. It would be a 
way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6121) [Tools] Improve merge tool cli ergonomic

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6121.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5000
[https://github.com/apache/arrow/pull/5000]

> [Tools] Improve merge tool cli ergonomic
> 
>
> Key: ARROW-6121
> URL: https://issues.apache.org/jira/browse/ARROW-6121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> * Accepts the pull-request number as an optional (first) parameter to the 
> script
> * Supports reading the jira username/password from a file



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6005) [C++] parquet::arrow::FileReader::GetRecordBatchReader() does not behave as documented since ARROW-1012

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6005.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5012
[https://github.com/apache/arrow/pull/5012]

> [C++] parquet::arrow::FileReader::GetRecordBatchReader() does not behave as 
> documented since ARROW-1012
> ---
>
> Key: ARROW-6005
> URL: https://issues.apache.org/jira/browse/ARROW-6005
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Martin
>Assignee: Hatem Helal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> GetRecordBatchReader() should
> "Return a RecordBatchReader of row groups selected from row_group_indices, the
> ordering in row_group_indices matters." (that is what the doxygen string 
> says),
> *but:*
> Since change ARROW-1012, it ignores the {{row_group_indices}} argument.
> The {{row_group_indices_}} in the {{RowGroupRecordBatchReader}} that is 
> created are never used.
> Either the documentation should be changed, or the behavior should be 
> reverted.  I would prefer the latter, as I do not know how to make sure to 
> read a specific row groups anymore...



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6177) [C++] Add Arrow::Validate()

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6177:
-

 Summary: [C++] Add Arrow::Validate()
 Key: ARROW-6177
 URL: https://issues.apache.org/jira/browse/ARROW-6177
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Affects Versions: 0.14.1
Reporter: Antoine Pitrou


It's a bit weird to have {{ChunkedArray::Validate()}} and {{Table::Validate()}} 
methods but only a standalone {{ValidateArray}} function for arrays.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6177) [C++] Add Arrow::Validate()

2019-08-08 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6177:
--
Labels: easy  (was: )

> [C++] Add Arrow::Validate()
> ---
>
> Key: ARROW-6177
> URL: https://issues.apache.org/jira/browse/ARROW-6177
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Antoine Pitrou
>Priority: Trivial
>  Labels: easy
>
> It's a bit weird to have {{ChunkedArray::Validate()}} and 
> {{Table::Validate()}} methods but only a standalone {{ValidateArray}} 
> function for arrays.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-08 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6132.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5029
[https://github.com/apache/arrow/pull/5029]

> [Python] ListArray.from_arrays does not check validity of input arrays
> --
>
> Key: ARROW-6132
> URL: https://issues.apache.org/jira/browse/ARROW-6132
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.
> When creating a ListArray from offsets and values in python, there is no 
> validation of the offsets that it starts with 0 and ends with the length of 
> the array (but is that required? the docs seem to indicate that: 
> https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
>  ("The first value in the offsets array is 0, and the last element is the 
> length of the values array.").
> The array you get "seems" ok (the repr), but on conversion to python or 
> flattened arrays, things go wrong:
> {code}
> In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 
> In [62]: a
> Out[62]: 
> 
> [
>   [
> 1,
> 2
>   ],
>   [
> 3,
> 4
>   ]
> ]
> In [63]: a.flatten()
> Out[63]: 
> 
> [
>   0,   # <--- includes the 0
>   1,
>   2,
>   3,
>   4
> ]
> In [64]: a.to_pylist()
> Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes 
> more elements as garbage
> {code}
> Calling {{validate}} manually correctly raises:
> {code}
> In [65]: a.validate()
> ...
> ArrowInvalid: Final offset invariant not equal to values length: 10!=5
> {code}
> In C++ the main constructors are not safe, and as the caller you need to 
> ensure that the data is correct or call a safe (slower) constructor. But do 
> we want to use the unsafe / fast constructors without validation in Python as 
> default as well? Or should we do a call to {{validate}} here?
> A quick search seems to indicate that `pa.Array.from_buffers` does 
> validation, but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov resolved ARROW-6173.
---
Resolution: Not A Problem

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903072#comment-16903072
 ] 

Igor Yastrebov commented on ARROW-6173:
---

Oh, I see. It is the default behaviour for all submodules - feather, json, 
plasma, orc (some of which aren't built for conda) - it makes sense only to 
load what you need.

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type

2019-08-08 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903064#comment-16903064
 ] 

Joris Van den Bossche commented on ARROW-6176:
--

This might be done by adding a {{ExtenstionType.__arrow_ext_class__}} that 
points to the subclass? In that way {{pyarrow_wrap_array}} can use that class 
instead of one of the predefined ones in {{_array_classes}}.

> [Python] Allow to subclass ExtensionArray to attach to custom extension type
> 
>
> Key: ARROW-6176
> URL: https://issues.apache.org/jira/browse/ARROW-6176
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Currently, you can define a custom extension type in Python with 
> {code}
> class UuidType(pa.ExtensionType):
> def __init__(self):
> pa.ExtensionType.__init__(self, pa.binary(16))
> def __reduce__(self):
> return UuidType, ()
> {code}
> but the array you can create with this is always ExtensionArray. We should 
> provide a way to define a subclass (eg `UuidArray` in this case) that can 
> hold custom logic.
> For example, a user might want to define `UuidArray` such that `arr[i]` 
> returns an instance of Python's `uuid.UUID`
> From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type

2019-08-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6176:


 Summary: [Python] Allow to subclass ExtensionArray to attach to 
custom extension type
 Key: ARROW-6176
 URL: https://issues.apache.org/jira/browse/ARROW-6176
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, you can define a custom extension type in Python with 

{code}
class UuidType(pa.ExtensionType):

def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(16))

def __reduce__(self):
return UuidType, ()
{code}

but the array you can create with this is always ExtensionArray. We should 
provide a way to define a subclass (eg `UuidArray` in this case) that can hold 
custom logic.

For example, a user might want to define `UuidArray` such that `arr[i]` returns 
an instance of Python's `uuid.UUID`

>From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6175:
--
Labels: pull-request-available  (was: )

> [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet 
> complex vector API
> 
>
> Key: ARROW-6175
> URL: https://issues.apache.org/jira/browse/ARROW-6175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>
> i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
> the wrong {{MinorType}}.
> ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
> {{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
> {{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903046#comment-16903046
 ] 

Joris Van den Bossche edited comment on ARROW-6173 at 8/8/19 2:54 PM:
--

The IO modules are not imported by default in the main namespace, so you need 
to import it explicitly (the same is true for `pyarrow.parquet`), exactly as 
you did, or with 

{code}
import pyarrow.csv
{code}



was (Author: jorisvandenbossche):
The IO modules are not imported by default, so you need to import it explicitly 
(the same is true for `pyarrow.parquet`), exactly as you did, or with 

{code}
import pyarrow.csv
{code}


> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903046#comment-16903046
 ] 

Joris Van den Bossche commented on ARROW-6173:
--

The IO modules are not imported by default, so you need to import it explicitly 
(the same is true for `pyarrow.parquet`), exactly as you did, or with 

{code}
import pyarrow.csv
{code}


> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6174) [C++] Parquet tests produce invalid array

2019-08-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903045#comment-16903045
 ] 

Antoine Pitrou commented on ARROW-6174:
---

Ah, also I noticed that my patch I don't validate chunk #0 of a chunked array. 
It should :)

> [C++] Parquet tests produce invalid array
> -
>
> Key: ARROW-6174
> URL: https://issues.apache.org/jira/browse/ARROW-6174
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> If I patch {{Table::Validate()}} to also validate the underlying arrays:
> {code:c++}
> diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
> index 446010f93..e617470b5 100644
> --- a/cpp/src/arrow/table.cc
> +++ b/cpp/src/arrow/table.cc
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "arrow/array.h"
> @@ -184,10 +185,18 @@ Status ChunkedArray::Validate() const {
>}
>  
>const auto& type = *chunks_[0]->type();
> +  // Make sure chunks all have the same type, and validate them
>for (size_t i = 1; i < chunks_.size(); ++i) {
> -if (!chunks_[i]->type()->Equals(type)) {
> +const Array& chunk = *chunks_[i];
> +if (!chunk.type()->Equals(type)) {
>return Status::Invalid("In chunk ", i, " expected type ", 
> type.ToString(),
> - " but saw ", chunks_[i]->type()->ToString());
> + " but saw ", chunk.type()->ToString());
> +}
> +Status st = ValidateArray(chunk);
> +if (!st.ok()) {
> +  std::stringstream ss;
> +  ss << "Chunk " << i << ": " << st.message();
> +  return st.WithMessage(ss.str());
>  }
>}
>return Status::OK();
> @@ -343,7 +352,7 @@ class SimpleTable : public Table {
>}
>  }
>  
> -// Make sure columns are all the same length
> +// Make sure columns are all the same length, and validate them
>  for (int i = 0; i < num_columns(); ++i) {
>const ChunkedArray* col = columns_[i].get();
>if (col->length() != num_rows_) {
> @@ -351,6 +360,12 @@ class SimpleTable : public Table {
> " expected length ", num_rows_, " but got 
> length ",
> col->length());
>}
> +  Status st = col->Validate();
> +  if (!st.ok()) {
> +std::stringstream ss;
> +ss << "Column " << i << ": " << st.message();
> +return st.WithMessage(ss.str());
> +  }
>  }
>  return Status::OK();
>}
> {code}
> ... then {{parquet-arrow-test}} fails and then crashes:
> {code}
> [...]
> [ RUN  ] TestArrowReadWrite.TableWithChunkedColumns
> ../src/parquet/arrow/arrow-reader-writer-test.cc:347: Failure
> Failed
> 'WriteTable(*table, ::arrow::default_memory_pool(), sink, row_group_size, 
> default_writer_properties(), arrow_properties)' failed with Invalid: Column 
> 0: Chunk 1: Final offset invariant not equal to values length: 210!=733
> In ../src/arrow/array.cc, line 1229, code: ValidateListArray(array)
> In ../src/parquet/arrow/writer.cc, line 1210, code: table.Validate()
> In ../src/parquet/arrow/writer.cc, line 1252, code: writer->WriteTable(table, 
> chunk_size)
> ../src/parquet/arrow/arrow-reader-writer-test.cc:419: Failure
> Expected: WriteTableToBuffer(table, row_group_size, arrow_properties, 
> ) doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> /home/antoine/arrow/dev/cpp/build-support/run-test.sh : ligne 97 : 28927 
> Erreur de segmentation  $TEST_EXECUTABLE "$@" 2>&1
>  28930 Fini| $ROOT/build-support/asan_symbolize.py
>  28933 Fini| ${CXXFILT:-c++filt}
>  28936 Fini| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  28939 Fini| $pipe_cmd 2>&1
>  28941 Fini| tee $LOGFILE
> ~/arrow/dev/cpp/build-test/src/parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6150.
---
   Resolution: Not A Problem
Fix Version/s: (was: 0.14.1)

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-6150:
-

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
> Fix For: 0.14.1
>
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6169.
---
Resolution: Duplicate

Duplicate of ARROW-6167

> A bit confused by arrow::install_arrow() inR
> 
>
> Key: ARROW-6169
> URL: https://issues.apache.org/jira/browse/ARROW-6169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Michael Chirico
>Priority: Minor
>
> Am trying to get up and running on arrow from R for the first time (macOS 
> Mojave 10.14.6)
> Started with
> {code:r}
> install.packages('arrow') #success!
> write_parquet(iris, 'tmp.parquet') #oh no!{code}
> and hit the error:
> > Error in {{Table\_\_from\_dots(dots, schema)}} : Cannot call 
> >{{Table\_\_from\_dots()}}. Please use {{arrow::install_arrow()}} to install 
> >required runtime libraries. 
> OK, easy enough:
> {code:r}
> arrow::install_arrow() 
> {code}
> With output:
> {code:java}
> You may be able to get a development version of the Arrow C++ library using 
> Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
> guide  for instructions on 
> building the library from source.
> After you've installed the C++ library, you'll need to reinstall the R 
> package from source to find it.
> Refer to the R package README 
>  for further details.
> If you have other trouble, or if you think this message could be improved, 
> please report an issue here: 
> 
> {code}
> A few points of confusion for me as a first time user:
> A bit surprised I'm being directed to install the development version? If the 
> current CRAN version of {{arrow}} is only compatible with the dev version, I 
> guess that could be made more clear in this message. But on the other hand, 
> the linked GH README suggests the opposite: "On macOS and Windows, installing 
> a binary package from CRAN will handle Arrow’s C++ dependencies for you." 
> However, that doesn't appear to have been the case for me.
> Oh well, let's just try installing the normal version & see if that works:
> {code:r}
> $brew install apache-arrow
> >install.packages('arrow') #reinstall in fresh session
> >arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
> Now I try the dev version:
> {code:r}
> brew install apache-arrow --HEAD
> # Error: apache-arrow 0.14.1 is already installed
> # To install HEAD, first run `brew unlink apache-arrow`.
> brew unlink apache-arrow
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/autoconf not present or broken
> # Please reinstall autoconf. Sorry :(
> brew install autoconf
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/cmake not present or broken
> # Please reinstall cmake. Sorry :(
> brew install cmake
> brew install apache-arrow --HEAD
> # cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
> # Last 15 lines from 
> /Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
> # 
> dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
> # 2): Symbol not found: ___addtf3
> # Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Expected in: /usr/lib/libSystem.B.dylib
> # in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Call Stack (most recent call first):
> # src/arrow/python/CMakeLists.txt:23 (find_package){code}
> Poked around a bit about that error and what I see suggests re-installing 
> {{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
> scipy}}, though the traceback does suggest it's a Python 3 thing)
> So now I'm stuck & not sure how to proceed.
> I'll also add that I'm not sure what to make of this:
> > After you've installed the C++ library, you'll need to reinstall the R 
> >package from source to find it.
> What is "find it" referring to exactly? And installing from source here means 
> {{R CMD build && R CMD INSTALL}} on the cloned repo?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-08 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6175:
-

 Summary: [Java] Fix MapVector#getMinorType and extend 
AbstractContainerVector addOrGet complex vector API
 Key: ARROW-6175
 URL: https://issues.apache.org/jira/browse/ARROW-6175
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
the wrong {{MinorType}}.

ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
{{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
{{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-08 Thread Saurabh Bajaj (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Bajaj closed ARROW-6150.

   Resolution: Fixed
Fix Version/s: 0.14.1

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
> Fix For: 0.14.1
>
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Neal Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903034#comment-16903034
 ] 

Neal Richardson commented on ARROW-6169:


This is a side effect of https://issues.apache.org/jira/browse/ARROW-6167. 
macOS shouldn't ever get a build that tells them to {{install_arrow()}}.

[~michaelchirico] see [https://github.com/apache/arrow/blob/master/r/Dockerfile]

> A bit confused by arrow::install_arrow() inR
> 
>
> Key: ARROW-6169
> URL: https://issues.apache.org/jira/browse/ARROW-6169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Michael Chirico
>Priority: Minor
>
> Am trying to get up and running on arrow from R for the first time (macOS 
> Mojave 10.14.6)
> Started with
> {code:r}
> install.packages('arrow') #success!
> write_parquet(iris, 'tmp.parquet') #oh no!{code}
> and hit the error:
> > Error in {{Table\_\_from\_dots(dots, schema)}} : Cannot call 
> >{{Table\_\_from\_dots()}}. Please use {{arrow::install_arrow()}} to install 
> >required runtime libraries. 
> OK, easy enough:
> {code:r}
> arrow::install_arrow() 
> {code}
> With output:
> {code:java}
> You may be able to get a development version of the Arrow C++ library using 
> Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
> guide  for instructions on 
> building the library from source.
> After you've installed the C++ library, you'll need to reinstall the R 
> package from source to find it.
> Refer to the R package README 
>  for further details.
> If you have other trouble, or if you think this message could be improved, 
> please report an issue here: 
> 
> {code}
> A few points of confusion for me as a first time user:
> A bit surprised I'm being directed to install the development version? If the 
> current CRAN version of {{arrow}} is only compatible with the dev version, I 
> guess that could be made more clear in this message. But on the other hand, 
> the linked GH README suggests the opposite: "On macOS and Windows, installing 
> a binary package from CRAN will handle Arrow’s C++ dependencies for you." 
> However, that doesn't appear to have been the case for me.
> Oh well, let's just try installing the normal version & see if that works:
> {code:r}
> $brew install apache-arrow
> >install.packages('arrow') #reinstall in fresh session
> >arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
> Now I try the dev version:
> {code:r}
> brew install apache-arrow --HEAD
> # Error: apache-arrow 0.14.1 is already installed
> # To install HEAD, first run `brew unlink apache-arrow`.
> brew unlink apache-arrow
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/autoconf not present or broken
> # Please reinstall autoconf. Sorry :(
> brew install autoconf
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/cmake not present or broken
> # Please reinstall cmake. Sorry :(
> brew install cmake
> brew install apache-arrow --HEAD
> # cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
> # Last 15 lines from 
> /Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
> # 
> dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
> # 2): Symbol not found: ___addtf3
> # Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Expected in: /usr/lib/libSystem.B.dylib
> # in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Call Stack (most recent call first):
> # src/arrow/python/CMakeLists.txt:23 (find_package){code}
> Poked around a bit about that error and what I see suggests re-installing 
> {{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
> scipy}}, though the traceback does suggest it's a Python 3 thing)
> So now I'm stuck & not sure how to proceed.
> I'll also add that I'm not sure what to make of this:
> > After you've installed the C++ library, you'll need to reinstall the R 
> >package from source to find it.
> What is "find it" referring to exactly? And installing from source here means 
> {{R CMD build && R CMD INSTALL}} on the cloned repo?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-08 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903033#comment-16903033
 ] 

Saurabh Bajaj commented on ARROW-6150:
--

Turns out this was being cause by duplication of computation of "dask.get" 
tasks on Delayed objects, hence a user error. Closing this ticket. 

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated ARROW-6169:
---
Description: 
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:r}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in {{Table\_\_from\_dots(dots, schema)}} : Cannot call 
>{{Table__from_dots()}}. Please use {{arrow::install_arrow()}} to install 
>required runtime libraries. 

OK, easy enough:
{code:r}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message could be improved, 
please report an issue here: 

{code}
A few points of confusion for me as a first time user:

A bit surprised I'm being directed to install the development version? If the 
current CRAN version of {{arrow}} is only compatible with the dev version, I 
guess that could be made more clear in this message. But on the other hand, the 
linked GH README suggests the opposite: "On macOS and Windows, installing a 
binary package from CRAN will handle Arrow’s C++ dependencies for you." 
However, that doesn't appear to have been the case for me.

Oh well, let's just try installing the normal version & see if that works:
{code:r}
$brew install apache-arrow
>install.packages('arrow') #reinstall in fresh session
>arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
Now I try the dev version:
{code:r}
brew install apache-arrow --HEAD
# Error: apache-arrow 0.14.1 is already installed
# To install HEAD, first run `brew unlink apache-arrow`.
brew unlink apache-arrow
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/autoconf not present or broken
# Please reinstall autoconf. Sorry :(
brew install autoconf
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/cmake not present or broken
# Please reinstall cmake. Sorry :(
brew install cmake
brew install apache-arrow --HEAD
# cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
# Last 15 lines from 
/Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
# 
dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
# 2): Symbol not found: ___addtf3
# Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Expected in: /usr/lib/libSystem.B.dylib
# in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Call Stack (most recent call first):
# src/arrow/python/CMakeLists.txt:23 (find_package){code}
Poked around a bit about that error and what I see suggests re-installing 
{{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
scipy}}, though the traceback does suggest it's a Python 3 thing)

So now I'm stuck & not sure how to proceed.

I'll also add that I'm not sure what to make of this:

> After you've installed the C++ library, you'll need to reinstall the R 
>package from source to find it.

What is "find it" referring to exactly? And installing from source here means 
{{R CMD build && R CMD INSTALL}} on the cloned repo?

  was:
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:r}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in {{Table__from_dots(dots, schema)}} : Cannot call 
>{{Table__from_dots()}}. Please use {{arrow::install_arrow()}} to install 
>required runtime libraries. 

OK, easy enough:
{code:r}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think 

[jira] [Updated] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated ARROW-6169:
---
Description: 
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:r}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in {{Table\_\_from\_dots(dots, schema)}} : Cannot call 
>{{Table\_\_from\_dots()}}. Please use {{arrow::install_arrow()}} to install 
>required runtime libraries. 

OK, easy enough:
{code:r}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message could be improved, 
please report an issue here: 

{code}
A few points of confusion for me as a first time user:

A bit surprised I'm being directed to install the development version? If the 
current CRAN version of {{arrow}} is only compatible with the dev version, I 
guess that could be made more clear in this message. But on the other hand, the 
linked GH README suggests the opposite: "On macOS and Windows, installing a 
binary package from CRAN will handle Arrow’s C++ dependencies for you." 
However, that doesn't appear to have been the case for me.

Oh well, let's just try installing the normal version & see if that works:
{code:r}
$brew install apache-arrow
>install.packages('arrow') #reinstall in fresh session
>arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
Now I try the dev version:
{code:r}
brew install apache-arrow --HEAD
# Error: apache-arrow 0.14.1 is already installed
# To install HEAD, first run `brew unlink apache-arrow`.
brew unlink apache-arrow
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/autoconf not present or broken
# Please reinstall autoconf. Sorry :(
brew install autoconf
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/cmake not present or broken
# Please reinstall cmake. Sorry :(
brew install cmake
brew install apache-arrow --HEAD
# cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
# Last 15 lines from 
/Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
# 
dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
# 2): Symbol not found: ___addtf3
# Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Expected in: /usr/lib/libSystem.B.dylib
# in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Call Stack (most recent call first):
# src/arrow/python/CMakeLists.txt:23 (find_package){code}
Poked around a bit about that error and what I see suggests re-installing 
{{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
scipy}}, though the traceback does suggest it's a Python 3 thing)

So now I'm stuck & not sure how to proceed.

I'll also add that I'm not sure what to make of this:

> After you've installed the C++ library, you'll need to reinstall the R 
>package from source to find it.

What is "find it" referring to exactly? And installing from source here means 
{{R CMD build && R CMD INSTALL}} on the cloned repo?

  was:
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:r}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in {{Table\_\_from\_dots(dots, schema)}} : Cannot call 
>{{Table__from_dots()}}. Please use {{arrow::install_arrow()}} to install 
>required runtime libraries. 

OK, easy enough:
{code:r}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you 

[jira] [Updated] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated ARROW-6169:
---
Description: 
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:r}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in {{Table__from_dots(dots, schema)}} : Cannot call 
>{{Table__from_dots()}}. Please use {{arrow::install_arrow()}} to install 
>required runtime libraries. 

OK, easy enough:
{code:r}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message could be improved, 
please report an issue here: 

{code}
A few points of confusion for me as a first time user:

A bit surprised I'm being directed to install the development version? If the 
current CRAN version of {{arrow}} is only compatible with the dev version, I 
guess that could be made more clear in this message. But on the other hand, the 
linked GH README suggests the opposite: "On macOS and Windows, installing a 
binary package from CRAN will handle Arrow’s C++ dependencies for you." 
However, that doesn't appear to have been the case for me.

Oh well, let's just try installing the normal version & see if that works:
{code:r}
$brew install apache-arrow
>install.packages('arrow') #reinstall in fresh session
>arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
Now I try the dev version:
{code:r}
brew install apache-arrow --HEAD
# Error: apache-arrow 0.14.1 is already installed
# To install HEAD, first run `brew unlink apache-arrow`.
brew unlink apache-arrow
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/autoconf not present or broken
# Please reinstall autoconf. Sorry :(
brew install autoconf
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/cmake not present or broken
# Please reinstall cmake. Sorry :(
brew install cmake
brew install apache-arrow --HEAD
# cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
# Last 15 lines from 
/Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
# 
dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
# 2): Symbol not found: ___addtf3
# Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Expected in: /usr/lib/libSystem.B.dylib
# in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Call Stack (most recent call first):
# src/arrow/python/CMakeLists.txt:23 (find_package){code}
Poked around a bit about that error and what I see suggests re-installing 
{{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
scipy}}, though the traceback does suggest it's a Python 3 thing)

So now I'm stuck & not sure how to proceed.

I'll also add that I'm not sure what to make of this:

> After you've installed the C++ library, you'll need to reinstall the R 
>package from source to find it.

What is "find it" referring to exactly? And installing from source here means 
{{R CMD build && R CMD INSTALL}} on the cloned repo?

  was:
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:java}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
>Please use arrow::install_arrow() to install required runtime libraries. 

OK, easy enough:
{code:java}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this 

[jira] [Updated] (ARROW-6174) [C++] Parquet tests produce invalid array

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6174:

Fix Version/s: 0.15.0

> [C++] Parquet tests produce invalid array
> -
>
> Key: ARROW-6174
> URL: https://issues.apache.org/jira/browse/ARROW-6174
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.15.0
>
>
> If I patch {{Table::Validate()}} to also validate the underlying arrays:
> {code:c++}
> diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
> index 446010f93..e617470b5 100644
> --- a/cpp/src/arrow/table.cc
> +++ b/cpp/src/arrow/table.cc
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "arrow/array.h"
> @@ -184,10 +185,18 @@ Status ChunkedArray::Validate() const {
>}
>  
>const auto& type = *chunks_[0]->type();
> +  // Make sure chunks all have the same type, and validate them
>for (size_t i = 1; i < chunks_.size(); ++i) {
> -if (!chunks_[i]->type()->Equals(type)) {
> +const Array& chunk = *chunks_[i];
> +if (!chunk.type()->Equals(type)) {
>return Status::Invalid("In chunk ", i, " expected type ", 
> type.ToString(),
> - " but saw ", chunks_[i]->type()->ToString());
> + " but saw ", chunk.type()->ToString());
> +}
> +Status st = ValidateArray(chunk);
> +if (!st.ok()) {
> +  std::stringstream ss;
> +  ss << "Chunk " << i << ": " << st.message();
> +  return st.WithMessage(ss.str());
>  }
>}
>return Status::OK();
> @@ -343,7 +352,7 @@ class SimpleTable : public Table {
>}
>  }
>  
> -// Make sure columns are all the same length
> +// Make sure columns are all the same length, and validate them
>  for (int i = 0; i < num_columns(); ++i) {
>const ChunkedArray* col = columns_[i].get();
>if (col->length() != num_rows_) {
> @@ -351,6 +360,12 @@ class SimpleTable : public Table {
> " expected length ", num_rows_, " but got 
> length ",
> col->length());
>}
> +  Status st = col->Validate();
> +  if (!st.ok()) {
> +std::stringstream ss;
> +ss << "Column " << i << ": " << st.message();
> +return st.WithMessage(ss.str());
> +  }
>  }
>  return Status::OK();
>}
> {code}
> ... then {{parquet-arrow-test}} fails and then crashes:
> {code}
> [...]
> [ RUN  ] TestArrowReadWrite.TableWithChunkedColumns
> ../src/parquet/arrow/arrow-reader-writer-test.cc:347: Failure
> Failed
> 'WriteTable(*table, ::arrow::default_memory_pool(), sink, row_group_size, 
> default_writer_properties(), arrow_properties)' failed with Invalid: Column 
> 0: Chunk 1: Final offset invariant not equal to values length: 210!=733
> In ../src/arrow/array.cc, line 1229, code: ValidateListArray(array)
> In ../src/parquet/arrow/writer.cc, line 1210, code: table.Validate()
> In ../src/parquet/arrow/writer.cc, line 1252, code: writer->WriteTable(table, 
> chunk_size)
> ../src/parquet/arrow/arrow-reader-writer-test.cc:419: Failure
> Expected: WriteTableToBuffer(table, row_group_size, arrow_properties, 
> ) doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> /home/antoine/arrow/dev/cpp/build-support/run-test.sh : ligne 97 : 28927 
> Erreur de segmentation  $TEST_EXECUTABLE "$@" 2>&1
>  28930 Fini| $ROOT/build-support/asan_symbolize.py
>  28933 Fini| ${CXXFILT:-c++filt}
>  28936 Fini| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  28939 Fini| $pipe_cmd 2>&1
>  28941 Fini| tee $LOGFILE
> ~/arrow/dev/cpp/build-test/src/parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6174) [C++] Parquet tests produce invalid array

2019-08-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6174:

Labels: parquet  (was: )

> [C++] Parquet tests produce invalid array
> -
>
> Key: ARROW-6174
> URL: https://issues.apache.org/jira/browse/ARROW-6174
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> If I patch {{Table::Validate()}} to also validate the underlying arrays:
> {code:c++}
> diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
> index 446010f93..e617470b5 100644
> --- a/cpp/src/arrow/table.cc
> +++ b/cpp/src/arrow/table.cc
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "arrow/array.h"
> @@ -184,10 +185,18 @@ Status ChunkedArray::Validate() const {
>}
>  
>const auto& type = *chunks_[0]->type();
> +  // Make sure chunks all have the same type, and validate them
>for (size_t i = 1; i < chunks_.size(); ++i) {
> -if (!chunks_[i]->type()->Equals(type)) {
> +const Array& chunk = *chunks_[i];
> +if (!chunk.type()->Equals(type)) {
>return Status::Invalid("In chunk ", i, " expected type ", 
> type.ToString(),
> - " but saw ", chunks_[i]->type()->ToString());
> + " but saw ", chunk.type()->ToString());
> +}
> +Status st = ValidateArray(chunk);
> +if (!st.ok()) {
> +  std::stringstream ss;
> +  ss << "Chunk " << i << ": " << st.message();
> +  return st.WithMessage(ss.str());
>  }
>}
>return Status::OK();
> @@ -343,7 +352,7 @@ class SimpleTable : public Table {
>}
>  }
>  
> -// Make sure columns are all the same length
> +// Make sure columns are all the same length, and validate them
>  for (int i = 0; i < num_columns(); ++i) {
>const ChunkedArray* col = columns_[i].get();
>if (col->length() != num_rows_) {
> @@ -351,6 +360,12 @@ class SimpleTable : public Table {
> " expected length ", num_rows_, " but got 
> length ",
> col->length());
>}
> +  Status st = col->Validate();
> +  if (!st.ok()) {
> +std::stringstream ss;
> +ss << "Column " << i << ": " << st.message();
> +return st.WithMessage(ss.str());
> +  }
>  }
>  return Status::OK();
>}
> {code}
> ... then {{parquet-arrow-test}} fails and then crashes:
> {code}
> [...]
> [ RUN  ] TestArrowReadWrite.TableWithChunkedColumns
> ../src/parquet/arrow/arrow-reader-writer-test.cc:347: Failure
> Failed
> 'WriteTable(*table, ::arrow::default_memory_pool(), sink, row_group_size, 
> default_writer_properties(), arrow_properties)' failed with Invalid: Column 
> 0: Chunk 1: Final offset invariant not equal to values length: 210!=733
> In ../src/arrow/array.cc, line 1229, code: ValidateListArray(array)
> In ../src/parquet/arrow/writer.cc, line 1210, code: table.Validate()
> In ../src/parquet/arrow/writer.cc, line 1252, code: writer->WriteTable(table, 
> chunk_size)
> ../src/parquet/arrow/arrow-reader-writer-test.cc:419: Failure
> Expected: WriteTableToBuffer(table, row_group_size, arrow_properties, 
> ) doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> /home/antoine/arrow/dev/cpp/build-support/run-test.sh : ligne 97 : 28927 
> Erreur de segmentation  $TEST_EXECUTABLE "$@" 2>&1
>  28930 Fini| $ROOT/build-support/asan_symbolize.py
>  28933 Fini| ${CXXFILT:-c++filt}
>  28936 Fini| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  28939 Fini| $pipe_cmd 2>&1
>  28941 Fini| tee $LOGFILE
> ~/arrow/dev/cpp/build-test/src/parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6174) [C++] Parquet tests produce invalid array

2019-08-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903009#comment-16903009
 ] 

Antoine Pitrou commented on ARROW-6174:
---

[~wesmckinn]

> [C++] Parquet tests produce invalid array
> -
>
> Key: ARROW-6174
> URL: https://issues.apache.org/jira/browse/ARROW-6174
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> If I patch {{Table::Validate()}} to also validate the underlying arrays:
> {code:c++}
> diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
> index 446010f93..e617470b5 100644
> --- a/cpp/src/arrow/table.cc
> +++ b/cpp/src/arrow/table.cc
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "arrow/array.h"
> @@ -184,10 +185,18 @@ Status ChunkedArray::Validate() const {
>}
>  
>const auto& type = *chunks_[0]->type();
> +  // Make sure chunks all have the same type, and validate them
>for (size_t i = 1; i < chunks_.size(); ++i) {
> -if (!chunks_[i]->type()->Equals(type)) {
> +const Array& chunk = *chunks_[i];
> +if (!chunk.type()->Equals(type)) {
>return Status::Invalid("In chunk ", i, " expected type ", 
> type.ToString(),
> - " but saw ", chunks_[i]->type()->ToString());
> + " but saw ", chunk.type()->ToString());
> +}
> +Status st = ValidateArray(chunk);
> +if (!st.ok()) {
> +  std::stringstream ss;
> +  ss << "Chunk " << i << ": " << st.message();
> +  return st.WithMessage(ss.str());
>  }
>}
>return Status::OK();
> @@ -343,7 +352,7 @@ class SimpleTable : public Table {
>}
>  }
>  
> -// Make sure columns are all the same length
> +// Make sure columns are all the same length, and validate them
>  for (int i = 0; i < num_columns(); ++i) {
>const ChunkedArray* col = columns_[i].get();
>if (col->length() != num_rows_) {
> @@ -351,6 +360,12 @@ class SimpleTable : public Table {
> " expected length ", num_rows_, " but got 
> length ",
> col->length());
>}
> +  Status st = col->Validate();
> +  if (!st.ok()) {
> +std::stringstream ss;
> +ss << "Column " << i << ": " << st.message();
> +return st.WithMessage(ss.str());
> +  }
>  }
>  return Status::OK();
>}
> {code}
> ... then {{parquet-arrow-test}} fails and then crashes:
> {code}
> [...]
> [ RUN  ] TestArrowReadWrite.TableWithChunkedColumns
> ../src/parquet/arrow/arrow-reader-writer-test.cc:347: Failure
> Failed
> 'WriteTable(*table, ::arrow::default_memory_pool(), sink, row_group_size, 
> default_writer_properties(), arrow_properties)' failed with Invalid: Column 
> 0: Chunk 1: Final offset invariant not equal to values length: 210!=733
> In ../src/arrow/array.cc, line 1229, code: ValidateListArray(array)
> In ../src/parquet/arrow/writer.cc, line 1210, code: table.Validate()
> In ../src/parquet/arrow/writer.cc, line 1252, code: writer->WriteTable(table, 
> chunk_size)
> ../src/parquet/arrow/arrow-reader-writer-test.cc:419: Failure
> Expected: WriteTableToBuffer(table, row_group_size, arrow_properties, 
> ) doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> /home/antoine/arrow/dev/cpp/build-support/run-test.sh : ligne 97 : 28927 
> Erreur de segmentation  $TEST_EXECUTABLE "$@" 2>&1
>  28930 Fini| $ROOT/build-support/asan_symbolize.py
>  28933 Fini| ${CXXFILT:-c++filt}
>  28936 Fini| 
> $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
>  28939 Fini| $pipe_cmd 2>&1
>  28941 Fini| tee $LOGFILE
> ~/arrow/dev/cpp/build-test/src/parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6174) [C++] Parquet tests produce invalid array

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6174:
-

 Summary: [C++] Parquet tests produce invalid array
 Key: ARROW-6174
 URL: https://issues.apache.org/jira/browse/ARROW-6174
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


If I patch {{Table::Validate()}} to also validate the underlying arrays:
{code:c++}
diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index 446010f93..e617470b5 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "arrow/array.h"
@@ -184,10 +185,18 @@ Status ChunkedArray::Validate() const {
   }
 
   const auto& type = *chunks_[0]->type();
+  // Make sure chunks all have the same type, and validate them
   for (size_t i = 1; i < chunks_.size(); ++i) {
-if (!chunks_[i]->type()->Equals(type)) {
+const Array& chunk = *chunks_[i];
+if (!chunk.type()->Equals(type)) {
   return Status::Invalid("In chunk ", i, " expected type ", 
type.ToString(),
- " but saw ", chunks_[i]->type()->ToString());
+ " but saw ", chunk.type()->ToString());
+}
+Status st = ValidateArray(chunk);
+if (!st.ok()) {
+  std::stringstream ss;
+  ss << "Chunk " << i << ": " << st.message();
+  return st.WithMessage(ss.str());
 }
   }
   return Status::OK();
@@ -343,7 +352,7 @@ class SimpleTable : public Table {
   }
 }
 
-// Make sure columns are all the same length
+// Make sure columns are all the same length, and validate them
 for (int i = 0; i < num_columns(); ++i) {
   const ChunkedArray* col = columns_[i].get();
   if (col->length() != num_rows_) {
@@ -351,6 +360,12 @@ class SimpleTable : public Table {
" expected length ", num_rows_, " but got 
length ",
col->length());
   }
+  Status st = col->Validate();
+  if (!st.ok()) {
+std::stringstream ss;
+ss << "Column " << i << ": " << st.message();
+return st.WithMessage(ss.str());
+  }
 }
 return Status::OK();
   }
{code}

... then {{parquet-arrow-test}} fails and then crashes:
{code}
[...]
[ RUN  ] TestArrowReadWrite.TableWithChunkedColumns
../src/parquet/arrow/arrow-reader-writer-test.cc:347: Failure
Failed
'WriteTable(*table, ::arrow::default_memory_pool(), sink, row_group_size, 
default_writer_properties(), arrow_properties)' failed with Invalid: Column 0: 
Chunk 1: Final offset invariant not equal to values length: 210!=733
In ../src/arrow/array.cc, line 1229, code: ValidateListArray(array)
In ../src/parquet/arrow/writer.cc, line 1210, code: table.Validate()
In ../src/parquet/arrow/writer.cc, line 1252, code: writer->WriteTable(table, 
chunk_size)
../src/parquet/arrow/arrow-reader-writer-test.cc:419: Failure
Expected: WriteTableToBuffer(table, row_group_size, arrow_properties, ) 
doesn't generate new fatal failures in the current thread.
  Actual: it does.
/home/antoine/arrow/dev/cpp/build-support/run-test.sh : ligne 97 : 28927 Erreur 
de segmentation  $TEST_EXECUTABLE "$@" 2>&1
 28930 Fini| $ROOT/build-support/asan_symbolize.py
 28933 Fini| ${CXXFILT:-c++filt}
 28936 Fini| 
$ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
 28939 Fini| $pipe_cmd 2>&1
 28941 Fini| tee $LOGFILE
~/arrow/dev/cpp/build-test/src/parquet

{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903008#comment-16903008
 ] 

Wes McKinney commented on ARROW-6169:
-

{{install_arrow()}} should only be directing users to install versions on 
Homebrew corresponding to releases

[~romainfrancois] [~npr]

> A bit confused by arrow::install_arrow() inR
> 
>
> Key: ARROW-6169
> URL: https://issues.apache.org/jira/browse/ARROW-6169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Michael Chirico
>Priority: Minor
>
> Am trying to get up and running on arrow from R for the first time (macOS 
> Mojave 10.14.6)
> Started with
> {code:java}
> install.packages('arrow') #success!
> write_parquet(iris, 'tmp.parquet') #oh no!{code}
> and hit the error:
> > Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
> >Please use arrow::install_arrow() to install required runtime libraries. 
> OK, easy enough:
> {code:java}
> arrow::install_arrow() 
> {code}
> With output:
> {code:java}
> You may be able to get a development version of the Arrow C++ library using 
> Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
> guide  for instructions on 
> building the library from source.
> After you've installed the C++ library, you'll need to reinstall the R 
> package from source to find it.
> Refer to the R package README 
>  for further details.
> If you have other trouble, or if you think this message could be improved, 
> please report an issue here: 
> 
> {code}
> A few points of confusion for me as a first time user:
> A bit surprised I'm being directed to install the development version? If the 
> current CRAN version of {{arrow}} is only compatible with the dev version, I 
> guess that could be made more clear in this message. But on the other hand, 
> the linked GH README suggests the opposite: "On macOS and Windows, installing 
> a binary package from CRAN will handle Arrow’s C++ dependencies for you." 
> However, that doesn't appear to have been the case for me.
> Oh well, let's just try installing the normal version & see if that works:
> {code:java}
> $brew install apache-arrow
> >install.packages('arrow') #reinstall in fresh session
> >arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
> Now I try the dev version:
> {code:java}
> brew install apache-arrow --HEAD
> # Error: apache-arrow 0.14.1 is already installed
> # To install HEAD, first run `brew unlink apache-arrow`.
> brew unlink apache-arrow
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/autoconf not present or broken
> # Please reinstall autoconf. Sorry :(
> brew install autoconf
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/cmake not present or broken
> # Please reinstall cmake. Sorry :(
> brew install cmake
> brew install apache-arrow --HEAD
> # cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
> # Last 15 lines from 
> /Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
> # 
> dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
> # 2): Symbol not found: ___addtf3
> # Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Expected in: /usr/lib/libSystem.B.dylib
> # in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Call Stack (most recent call first):
> # src/arrow/python/CMakeLists.txt:23 (find_package){code}
> Poked around a bit about that error and what I see suggests re-installing 
> {{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
> scipy}}, though the traceback does suggest it's a Python 3 thing)
> So now I'm stuck & not sure how to proceed.
> I'll also add that I'm not sure what to make of this:
> > After you've installed the C++ library, you'll need to reinstall the R 
> >package from source to find it.
> What is "find it" referring to exactly? And installing from source here means 
> {{R CMD build && R CMD INSTALL}} on the cloned repo?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-6173:
--
Affects Version/s: 0.12.0
   0.13.0
   0.14.0

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

2019-08-08 Thread Hatem Helal (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902896#comment-16902896
 ] 

Hatem Helal commented on ARROW-3246:


>  If the dictionary is written all at once then this property can be 
>circumvented, that would be my plan.

I like that plan.
 
 

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-6173:
--
Description: 
When I create a new environment in conda:
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
and try to read a csv file:
{code:java}
import pyarrow as pa
pa.csv.read_csv('test.csv'){code}
it fails with an error:
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
and using pa.csv.read_csv() after loading it directly also works.

  was:
When I create a new environment in conda:
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
and try to read a csv file:
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
and using pa.csv.read_csv() after loading it directly also works.


> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-6173:
--
Description: 
When I create a new environment in conda:
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
and try to read a csv file:
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
and using pa.csv.read_csv() after loading it directly also works.

  was:
When I create a new environment in conda:

 
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
 

and try to read a csv file:

 
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:

 

 
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:

 

 
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
 

and using pa.csv.read_csv() after loading it directly also works.


> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)
Igor Yastrebov created ARROW-6173:
-

 Summary: [Python] error loading csv submodule
 Key: ARROW-6173
 URL: https://issues.apache.org/jira/browse/ARROW-6173
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Windows 7, conda 4.7.11
Reporter: Igor Yastrebov


When I create a new environment in conda:

 
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
 

and try to read a csv file:

 
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:

 

 
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:

 

 
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
 

and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6172) [Java] Avoid creating value holders repeatedly when reading data from JDBC

2019-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6172:
--
Labels: pull-request-available  (was: )

> [Java] Avoid creating value holders repeatedly when reading data from JDBC
> --
>
> Key: ARROW-6172
> URL: https://issues.apache.org/jira/browse/ARROW-6172
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>
> When converting JDBC data to Arrow data. A value holder is created for each 
> single value. The following code snippet gives an example:
> NullableSmallIntHolder holder = new NullableSmallIntHolder();
>  holder.isSet = isNonNull ? 1 : 0;
>  if (isNonNull) {
>  holder.value = (short) value;
>  }
>  smallIntVector.setSafe(rowCount, holder);
>  smallIntVector.setValueCount(rowCount + 1);
>  
> This is inefficient, both in terms of memory usage, and computational 
> efficiency. 
> For most types, we can improve the performance by directly setting the value.
> For example, the benchmarks on IntVector show that a 20% performance 
> improvement can be achieved by directly setting the int value:
>  
> Benchmark Mode Cnt Score Error Units
> IntBenchmarks.setIntDirectly avgt 5 15.397 ± 0.018 us/op
> IntBenchmarks.setWithValueHolder avgt 5 19.198 ± 0.789 us/op
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6172) [Java] Avoid creating value holders repeatedly when reading data from JDBC

2019-08-08 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6172:
---

 Summary: [Java] Avoid creating value holders repeatedly when 
reading data from JDBC
 Key: ARROW-6172
 URL: https://issues.apache.org/jira/browse/ARROW-6172
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When converting JDBC data to Arrow data. A value holder is created for each 
single value. The following code snippet gives an example:

NullableSmallIntHolder holder = new NullableSmallIntHolder();
 holder.isSet = isNonNull ? 1 : 0;
 if (isNonNull) {
 holder.value = (short) value;
 }
 smallIntVector.setSafe(rowCount, holder);
 smallIntVector.setValueCount(rowCount + 1);

 

This is inefficient, both in terms of memory usage, and computational 
efficiency. 

For most types, we can improve the performance by directly setting the value.

For example, the benchmarks on IntVector show that a 20% performance 
improvement can be achieved by directly setting the int value:

 

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 5 15.397 ± 0.018 us/op
IntBenchmarks.setWithValueHolder avgt 5 19.198 ± 0.789 us/op

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6082) [Python] create pa.dictionary() type with non-integer indices type crashes

2019-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6082:
--
Labels: pull-request-available  (was: )

> [Python] create pa.dictionary() type with non-integer indices type crashes
> --
>
> Key: ARROW-6082
> URL: https://issues.apache.org/jira/browse/ARROW-6082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>
> For example if you mixed the order of the indices and values type:
> {code}
> In [1]: pa.dictionary(pa.int8(), pa.string()) 
>   
>
> Out[1]: DictionaryType(dictionary)
> In [2]: pa.dictionary(pa.string(), pa.int8()) 
>   
>
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0731 14:40:42.748589 26310 type.cc:440]  Check failed: 
> is_integer(index_type->id()) dictionary index type should be signed integer
> *** Check failure stack trace: ***
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6134) [C++][Gandiva] Add concat function in Gandiva

2019-08-08 Thread Pindikura Ravindra (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pindikura Ravindra resolved ARROW-6134.
---
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5008
[https://github.com/apache/arrow/pull/5008]

> [C++][Gandiva] Add concat function in Gandiva
> -
>
> Key: ARROW-6134
> URL: https://issues.apache.org/jira/browse/ARROW-6134
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> * remove concat alias for concatOperator
>  * add concat(utf8, utf8) function. The difference between concat and 
> concatOperator is in null input handling. concatOperator returns null if one 
> of the inputs is null; concat treats null input as empty string



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5651) [Python] Incorrect conversion from strided Numpy array when other type is specified

2019-08-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/ARROW-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902842#comment-16902842
 ] 

Fabian Höring commented on ARROW-5651:
--

Thanks

> [Python] Incorrect conversion from strided Numpy array when other type is 
> specified
> ---
>
> Key: ARROW-5651
> URL: https://issues.apache.org/jira/browse/ARROW-5651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: Fabian Höring
>Assignee: Takuya Kato
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> In the example below the PyArrow array gives wrong results for strided numpy 
> arrays when the type is different from the initial Numpy type:
> {code}
> >> import pyarrow as pa
> >> import numpy as np
> >> np_array = np.arange(0, 10, dtype=np.float32)[1:-1:2]
> >> pa.array(np_array, type=pa.float64())
> 
> [
>   1,
>   2,
>   3,
>   4
> ]
> {code}
> When copying the Numpy array to a new location is gives the expected output:
> {code}
> >> import pyarrow as pa
> >> import numpy as np
> >> np_array = np.array(np.arange(0, 10, dtype=np.float32)[1:-1:2])
> >> pa.array(np_array, type=pa.float64())
> 
>[
>  1,
>  3,
>  5,
>  7 
> ]  
> {code}
> Looking at the 
> [code|https://github.com/apache/arrow/blob/7a5562174cffb21b16f990f64d114c1a94a30556/cpp/src/arrow/python/numpy_to_arrow.cc#L407]
>  it seems that to determine the number of elements, the target type is used 
> instead of the initial numpy type.
> In this case the stride is 8 bytes which corresponds to 2 elements in float32 
> whereas the codes tries to determine the number of elements with the target 
> type which gives 1 element of float64 and therefore it reads the array one by 
> one instead of every 2 elements until reaching the total number of elements.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6162) [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len parameter is zero

2019-08-08 Thread Prudhvi Porandla (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-6162:

Description: Do not truncate string if length parameter is 0 in 
castVARCHAR_utf8_int64 function.

> [C++][Gandiva] Do not truncate string in castVARCHAR_varchar when out_len 
> parameter is zero
> ---
>
> Key: ARROW-6162
> URL: https://issues.apache.org/jira/browse/ARROW-6162
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>
> Do not truncate string if length parameter is 0 in castVARCHAR_utf8_int64 
> function.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6171) [R] "docker-compose run r" fails

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6171:
-

 Summary: [R] "docker-compose run r" fails
 Key: ARROW-6171
 URL: https://issues.apache.org/jira/browse/ARROW-6171
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools, R
Reporter: Antoine Pitrou


I get the following failure:
{code}
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
  /opt/conda/lib/libarrow.so.100: undefined symbol: 
LZ4F_resetDecompressionContext
Error: loading failed
Execution halted
ERROR: loading failed
* removing '/usr/local/lib/R/site-library/arrow'
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6170) [R] "docker-compose build r" is slow

2019-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6170:
--
Labels: pull-request-available  (was: )

> [R] "docker-compose build r" is slow
> 
>
> Key: ARROW-6170
> URL: https://issues.apache.org/jira/browse/ARROW-6170
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools, R
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Apparently it installs and compiles all packages in single-thread mode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6170) [R] "docker-compose build r" is slow

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6170:
-

 Summary: [R] "docker-compose build r" is slow
 Key: ARROW-6170
 URL: https://issues.apache.org/jira/browse/ARROW-6170
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools, R
Reporter: Antoine Pitrou


Apparently it installs and compiles all packages in single-thread mode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available

2019-08-08 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6167.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5034
[https://github.com/apache/arrow/pull/5034]

> [R] macOS binary R packages on CRAN don't have arrow_available
> --
>
> Key: ARROW-6167
> URL: https://issues.apache.org/jira/browse/ARROW-6167
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.1
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The {{configure}} script in the R package has some 
> [magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that 
> should ensure that on macOS, you're guaranteed a successful library 
> installation even (especially) if you don't have libarrow installed on your 
> system. This magic also is designed so that when CRAN builds a binary package 
> for macOS, the C++ libraries are bundled and "just work" when a user installs 
> it, no compilation required. 
> However, the magic appeared to fail on CRAN this time, as the binaries linked 
> on [https://cran.r-project.org/web/packages/arrow/index.html] were built 
> without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). 
> I've identified three vectors by which you can get an arrow package 
> installation on macOS in this state:
>  # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to 
> see if you've already installed {{apache-arrow}} via Homebrew always passes, 
> so if you have Homebrew installed but haven't done {{brew install 
> apache-arrow}}, the script won't do it for you like it looks like it intends. 
> (This is not suspected to be the problem on CRAN because they don't have 
> Homebrew installed.)
>  # If the 
> "[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]; 
> installation fails, then the [test on 
> L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will 
> correctly fail. I managed to trigger this (by luck?) on the [R-hub testing 
> service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533].
>  This is possibly what happened on CRAN, though the only [build 
> logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html]
>  we have from CRAN are terse because it believes the build was successful. 
>  # Some idiosyncrasy in the compiler on the CRAN macOS system such that the 
> autobrew script would successfully download the arrow libraries but the L102 
> check would error. I've been unable to reproduce this using the [version of 
> clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/].
> I have a fix for the first one and will provide workaround documentation for 
> the README and announcement blog post. Unfortunately, I don't know that 
> there's anything we can do about the useless binaries on CRAN at this time, 
> particularly since CRAN is going down for maintenance August 9-18.
> cc [~jeroenooms] [~romainfrancois] [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6167) [R] macOS binary R packages on CRAN don't have arrow_available

2019-08-08 Thread Sutou Kouhei (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-6167:

Component/s: R

> [R] macOS binary R packages on CRAN don't have arrow_available
> --
>
> Key: ARROW-6167
> URL: https://issues.apache.org/jira/browse/ARROW-6167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The {{configure}} script in the R package has some 
> [magic|https://github.com/apache/arrow/blob/master/r/configure#L66-L86] that 
> should ensure that on macOS, you're guaranteed a successful library 
> installation even (especially) if you don't have libarrow installed on your 
> system. This magic also is designed so that when CRAN builds a binary package 
> for macOS, the C++ libraries are bundled and "just work" when a user installs 
> it, no compilation required. 
> However, the magic appeared to fail on CRAN this time, as the binaries linked 
> on [https://cran.r-project.org/web/packages/arrow/index.html] were built 
> without libarrow ({{arrow::arrow_available()}} returns {{FALSE}}). 
> I've identified three vectors by which you can get an arrow package 
> installation on macOS in this state:
>  # The [check|https://github.com/apache/arrow/blob/master/r/configure#L71] to 
> see if you've already installed {{apache-arrow}} via Homebrew always passes, 
> so if you have Homebrew installed but haven't done {{brew install 
> apache-arrow}}, the script won't do it for you like it looks like it intends. 
> (This is not suspected to be the problem on CRAN because they don't have 
> Homebrew installed.)
>  # If the 
> "[autobrew|https://github.com/apache/arrow/blob/master/r/configure#L80-L81]; 
> installation fails, then the [test on 
> L102|https://github.com/apache/arrow/blob/master/r/configure#L102] will 
> correctly fail. I managed to trigger this (by luck?) on the [R-hub testing 
> service|https://builder.r-hub.io/status/arrow_0.14.1.tar.gz-da083126612b46e28854b95156b87b31#L533].
>  This is possibly what happened on CRAN, though the only [build 
> logs|https://www.r-project.org/nosvn/R.check/r-release-osx-x86_64/arrow-00check.html]
>  we have from CRAN are terse because it believes the build was successful. 
>  # Some idiosyncrasy in the compiler on the CRAN macOS system such that the 
> autobrew script would successfully download the arrow libraries but the L102 
> check would error. I've been unable to reproduce this using the [version of 
> clang7 that CRAN provides|https://cran.r-project.org/bin/macosx/tools/].
> I have a fix for the first one and will provide workaround documentation for 
> the README and announcement blog post. Unfortunately, I don't know that 
> there's anything we can do about the useless binaries on CRAN at this time, 
> particularly since CRAN is going down for maintenance August 9-18.
> cc [~jeroenooms] [~romainfrancois] [~wesmckinn]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6131) [C++] Optimize the Arrow UTF-8-string-validation

2019-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6131:
--
Labels: pull-request-available  (was: )

> [C++]  Optimize the Arrow UTF-8-string-validation
> -
>
> Key: ARROW-6131
> URL: https://issues.apache.org/jira/browse/ARROW-6131
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
>
> The new Algorithm comes from: https://github.com/cyb70289/utf8 (MIT LICENSE)
> Range base algorithm:
>   1. Map each byte of input-string to Range table.
>   2. Leverage the Neon 'tbl' instruction to lookup table.
>   3. Find the pattern and set correct table index for each input byte
>   4. Validate input string.
> The Algorithm would improve utf8-validation ~1.6x Speedup for LargeNonAscii 
> and SmallNonAscii. But the algorithm would deteriorate the All-Ascii cases 
> (The input data is all ascii string).
> The benchmark API is  
> {code:java}
> ValidateUTF8
> {code}
> As far as I know, the data that is all-ascii is unusual on the internet.
> Could you guys please tell me what's the use case scenario for Apache Arrow? 
> Is the Arrow's data that need to be validated  all-ascii string?
> If not, I'd like to submit the patch to accelerate the NonAscii validation.
> As for All-Ascii  validation,  I would like to propose another optimization 
> solution with SIMD in another jira.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902757#comment-16902757
 ] 

Michael Chirico edited comment on ARROW-6169 at 8/8/19 7:34 AM:


Have finally managed to get things working by installing from the right tag on 
GH:
{code:java}
$brew link apache-arrow #undoing earlier unlink
>remotes::install_github("apache/arrow", subdir = "r", ref = 
>"apache-arrow-0.14.1"){code}
Next step is to try and get this to play well with GitLab CI on Ubuntu... any 
chance there's a Dockerfile laying around already? :)


was (Author: michaelchirico):
Have finally managed to get things working by installing from the right tag on 
GH:
{code:java}
$brew link apache-arrow
>remotes::install_github("apache/arrow", subdir = "r", ref = 
>"apache-arrow-0.14.1"){code}
Next step is to try and get this to play well with GitLab CI on Ubuntu... any 
chance there's a Dockerfile laying around already? :)

> A bit confused by arrow::install_arrow() inR
> 
>
> Key: ARROW-6169
> URL: https://issues.apache.org/jira/browse/ARROW-6169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Michael Chirico
>Priority: Minor
>
> Am trying to get up and running on arrow from R for the first time (macOS 
> Mojave 10.14.6)
> Started with
> {code:java}
> install.packages('arrow') #success!
> write_parquet(iris, 'tmp.parquet') #oh no!{code}
> and hit the error:
> > Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
> >Please use arrow::install_arrow() to install required runtime libraries. 
> OK, easy enough:
> {code:java}
> arrow::install_arrow() 
> {code}
> With output:
> {code:java}
> You may be able to get a development version of the Arrow C++ library using 
> Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
> guide  for instructions on 
> building the library from source.
> After you've installed the C++ library, you'll need to reinstall the R 
> package from source to find it.
> Refer to the R package README 
>  for further details.
> If you have other trouble, or if you think this message could be improved, 
> please report an issue here: 
> 
> {code}
> A few points of confusion for me as a first time user:
> A bit surprised I'm being directed to install the development version? If the 
> current CRAN version of {{arrow}} is only compatible with the dev version, I 
> guess that could be made more clear in this message. But on the other hand, 
> the linked GH README suggests the opposite: "On macOS and Windows, installing 
> a binary package from CRAN will handle Arrow’s C++ dependencies for you." 
> However, that doesn't appear to have been the case for me.
> Oh well, let's just try installing the normal version & see if that works:
> {code:java}
> $brew install apache-arrow
> >install.packages('arrow') #reinstall in fresh session
> >arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
> Now I try the dev version:
> {code:java}
> brew install apache-arrow --HEAD
> # Error: apache-arrow 0.14.1 is already installed
> # To install HEAD, first run `brew unlink apache-arrow`.
> brew unlink apache-arrow
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/autoconf not present or broken
> # Please reinstall autoconf. Sorry :(
> brew install autoconf
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/cmake not present or broken
> # Please reinstall cmake. Sorry :(
> brew install cmake
> brew install apache-arrow --HEAD
> # cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
> # Last 15 lines from 
> /Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
> # 
> dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
> # 2): Symbol not found: ___addtf3
> # Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Expected in: /usr/lib/libSystem.B.dylib
> # in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Call Stack (most recent call first):
> # src/arrow/python/CMakeLists.txt:23 (find_package){code}
> Poked around a bit about that error and what I see suggests re-installing 
> {{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
> scipy}}, though the traceback does suggest it's a Python 3 thing)
> So now I'm stuck & not 

[jira] [Commented] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902757#comment-16902757
 ] 

Michael Chirico commented on ARROW-6169:


Have finally managed to get things working by installing from the right tag on 
GH:
{code:java}
$brew link apache-arrow
>remotes::install_github("apache/arrow", subdir = "r", ref = 
>"apache-arrow-0.14.1"){code}
Next step is to try and get this to play well with GitLab CI on Ubuntu... any 
chance there's a Dockerfile laying around already? :)

> A bit confused by arrow::install_arrow() inR
> 
>
> Key: ARROW-6169
> URL: https://issues.apache.org/jira/browse/ARROW-6169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Michael Chirico
>Priority: Minor
>
> Am trying to get up and running on arrow from R for the first time (macOS 
> Mojave 10.14.6)
> Started with
> {code:java}
> install.packages('arrow') #success!
> write_parquet(iris, 'tmp.parquet') #oh no!{code}
> and hit the error:
> > Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
> >Please use arrow::install_arrow() to install required runtime libraries. 
> OK, easy enough:
> {code:java}
> arrow::install_arrow() 
> {code}
> With output:
> {code:java}
> You may be able to get a development version of the Arrow C++ library using 
> Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
> guide  for instructions on 
> building the library from source.
> After you've installed the C++ library, you'll need to reinstall the R 
> package from source to find it.
> Refer to the R package README 
>  for further details.
> If you have other trouble, or if you think this message could be improved, 
> please report an issue here: 
> 
> {code}
> A few points of confusion for me as a first time user:
> A bit surprised I'm being directed to install the development version? If the 
> current CRAN version of {{arrow}} is only compatible with the dev version, I 
> guess that could be made more clear in this message. But on the other hand, 
> the linked GH README suggests the opposite: "On macOS and Windows, installing 
> a binary package from CRAN will handle Arrow’s C++ dependencies for you." 
> However, that doesn't appear to have been the case for me.
> Oh well, let's just try installing the normal version & see if that works:
> {code:java}
> $brew install apache-arrow
> >install.packages('arrow') #reinstall in fresh session
> >arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
> Now I try the dev version:
> {code:java}
> brew install apache-arrow --HEAD
> # Error: apache-arrow 0.14.1 is already installed
> # To install HEAD, first run `brew unlink apache-arrow`.
> brew unlink apache-arrow
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/autoconf not present or broken
> # Please reinstall autoconf. Sorry :(
> brew install autoconf
> brew install apache-arrow --HEAD
> # Error: An exception occurred within a child process:
> # RuntimeError: /usr/local/opt/cmake not present or broken
> # Please reinstall cmake. Sorry :(
> brew install cmake
> brew install apache-arrow --HEAD
> # cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
> -DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
> # Last 15 lines from 
> /Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
> # 
> dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
> # 2): Symbol not found: ___addtf3
> # Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Expected in: /usr/lib/libSystem.B.dylib
> # in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
> # Call Stack (most recent call first):
> # src/arrow/python/CMakeLists.txt:23 (find_package){code}
> Poked around a bit about that error and what I see suggests re-installing 
> {{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
> scipy}}, though the traceback does suggest it's a Python 3 thing)
> So now I'm stuck & not sure how to proceed.
> I'll also add that I'm not sure what to make of this:
> > After you've installed the C++ library, you'll need to reinstall the R 
> >package from source to find it.
> What is "find it" referring to exactly? And installing from source here means 
> {{R CMD build && R CMD INSTALL}} on the cloned repo?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)
Michael Chirico created ARROW-6169:
--

 Summary: A bit confused by arrow::install_arrow() inR
 Key: ARROW-6169
 URL: https://issues.apache.org/jira/browse/ARROW-6169
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Michael Chirico


Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:java}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
>Please use arrow::install_arrow() to install required runtime libraries. 

OK, easy enough:
{code:java}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message could be improved, 
please report an issue here: 

{code}
A few points of confusion for me as a first time user:

A bit surprised I'm being directed to install the development version? If the 
current CRAN version of {{arrow}} is only compatible with the dev version, I 
guess that could be made more clear in this message. But on the other hand, the 
linked GH README suggests the opposite: "On macOS and Windows, installing a 
binary package from CRAN will handle Arrow’s C++ dependencies for you." 
However, that doesn't appear to have been the case for me.

Oh well, let's just try installing the normal version & see if that works:
{code:java}
$brew install apache-arrow
>install.packages('arrow') #reinstall in fresh session
>arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
Now I try the dev version:
{code:java}
brew install apache-arrow --HEAD
# Error: apache-arrow 0.14.1 is already installed
# To install HEAD, first run `brew unlink apache-arrow`.
brew unlink apache-arrow
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/autoconf not present or broken
# Please reinstall autoconf. Sorry :(
brew install autoconf
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/cmake not present or broken
# Please reinstall cmake. Sorry :(
brew install cmake
brew install apache-arrow --HEAD
# cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
# Last 15 lines from 
/Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
# 
dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
# 2): Symbol not found: ___addtf3
# Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Expected in: /usr/lib/libSystem.B.dylib
# in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Call Stack (most recent call first):
# src/arrow/python/CMakeLists.txt:23 (find_package){code}
Poked around a bit about that arrow and what I see suggests re-installing 
{{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
scipy}}, though the traceback does suggest it's a Python 3 thing)

So now I'm stuck & not sure how to proceed.

I'll also add that I'm not sure what to make of this:

> After you've installed the C++ library, you'll need to reinstall the R 
>package from source to find it.

What is "find it" referring to exactly? And installing from source here means 
{{R CMD build && R CMD INSTALL}} on the cloned repo?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chirico updated ARROW-6169:
---
Description: 
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:java}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
>Please use arrow::install_arrow() to install required runtime libraries. 

OK, easy enough:
{code:java}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message could be improved, 
please report an issue here: 

{code}
A few points of confusion for me as a first time user:

A bit surprised I'm being directed to install the development version? If the 
current CRAN version of {{arrow}} is only compatible with the dev version, I 
guess that could be made more clear in this message. But on the other hand, the 
linked GH README suggests the opposite: "On macOS and Windows, installing a 
binary package from CRAN will handle Arrow’s C++ dependencies for you." 
However, that doesn't appear to have been the case for me.

Oh well, let's just try installing the normal version & see if that works:
{code:java}
$brew install apache-arrow
>install.packages('arrow') #reinstall in fresh session
>arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
Now I try the dev version:
{code:java}
brew install apache-arrow --HEAD
# Error: apache-arrow 0.14.1 is already installed
# To install HEAD, first run `brew unlink apache-arrow`.
brew unlink apache-arrow
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/autoconf not present or broken
# Please reinstall autoconf. Sorry :(
brew install autoconf
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/cmake not present or broken
# Please reinstall cmake. Sorry :(
brew install cmake
brew install apache-arrow --HEAD
# cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
# Last 15 lines from 
/Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
# 
dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
# 2): Symbol not found: ___addtf3
# Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Expected in: /usr/lib/libSystem.B.dylib
# in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Call Stack (most recent call first):
# src/arrow/python/CMakeLists.txt:23 (find_package){code}
Poked around a bit about that error and what I see suggests re-installing 
{{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
scipy}}, though the traceback does suggest it's a Python 3 thing)

So now I'm stuck & not sure how to proceed.

I'll also add that I'm not sure what to make of this:

> After you've installed the C++ library, you'll need to reinstall the R 
>package from source to find it.

What is "find it" referring to exactly? And installing from source here means 
{{R CMD build && R CMD INSTALL}} on the cloned repo?

  was:
Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:java}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
>Please use arrow::install_arrow() to install required runtime libraries. 

OK, easy enough:
{code:java}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message