[jira] [Created] (ARROW-8216) filter method for Dataset doesn't distinguish between empty strings and NAs

2020-03-25 Thread Sam Albers (Jira)
Sam Albers created ARROW-8216:
-

 Summary: filter method for Dataset doesn't distinguish between 
empty strings and NAs
 Key: ARROW-8216
 URL: https://issues.apache.org/jira/browse/ARROW-8216
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.16.0
 Environment: R 3.6.3, Windows 10
Reporter: Sam Albers


 

I have just noticed some slightly odd behaviour with the filter method for 
Dataset. 
{code:java}
library(arrow)
library(dplyr)
packageVersion("arrow")
#> [1] '0.16.0.20200323'
## Make sample parquet
starwars$hair_color[starwars$hair_color == "brown"] <- ""
dir <- tempdir()
fpath <- file.path(dir, 'data.parquet')
write_parquet(starwars, fpath)
## df in memory
df_mem <- starwars %>% 
 filter(hair_color == "")
## reading from the parquet
df_parquet <- read_parquet(fpath) %>% 
 filter(hair_color == "")
## using open_dataset
df_dataset <- open_dataset(dir) %>% 
 filter(hair_color == "") %>% 
 collect()
{code}
I'm pretty sure all these should return the same data.frame. Am I missing 
something?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8118) dim method for FileSystemDataset

2020-03-13 Thread Sam Albers (Jira)
Sam Albers created ARROW-8118:
-

 Summary: dim method for FileSystemDataset
 Key: ARROW-8118
 URL: https://issues.apache.org/jira/browse/ARROW-8118
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Sam Albers


I been using this function enough that I wonder if a) would be useful in the 
package and b) whether this is something you think is worth working on. The 
basic problem is that if you have a hierarchical file structure that 
accommodates using open_dataset, it is definitely useful to know the amount of 
data you are dealing with. My idea is that 'FileSystemDataset' would have dim, 
nrow and ncol methods. Here is how I've been using it:
{code:java}
library(arrow)
x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
dim_arrow <- function(x) {
 rows <- sum(purrr::map_dbl(x$files, 
~ParquetFileReader$create(.x)$ReadTable()$num_rows))
 cols <- x$schema$num_fields
 
 c(rows, cols)
}
dim_arrow(x)
#> [1] 426929 7
{code}
 

Ideally this would work on arrow_dplyr_query objects as well but I haven't 
quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8075) Loading R.utils after arrow breaks some arrow functions

2020-03-11 Thread Sam Albers (Jira)
Sam Albers created ARROW-8075:
-

 Summary: Loading R.utils after arrow breaks some arrow functions
 Key: ARROW-8075
 URL: https://issues.apache.org/jira/browse/ARROW-8075
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.16.0
 Environment: - Session info 
--
 setting  value   
 version  R version 3.6.3 (2020-02-29)
 os   Windows 10 x64  
 system   x86_64, mingw32 
 ui   RStudio 
 language (EN)
 collate  English_Canada.1252 
 ctypeEnglish_Canada.1252 
 tz   America/Los_Angeles 
 date 2020-03-11  

- Packages 
--
 package * versiondate   lib source
 arrow   * 0.16.0.2   2020-02-14 [1] CRAN (R 3.6.2)
 assertthat0.2.1  2019-03-21 [1] CRAN (R 3.6.0)
 backports 1.1.5  2019-10-02 [1] CRAN (R 3.6.1)
 bit   1.1-15.2   2020-02-10 [1] CRAN (R 3.6.2)
 bit64 0.9-7  2017-05-08 [1] CRAN (R 3.6.0)
 callr 3.4.2  2020-02-12 [1] CRAN (R 3.6.2)
 cli   2.0.2  2020-02-28 [1] CRAN (R 3.6.2)
 crayon1.3.4  2017-09-16 [1] CRAN (R 3.6.0)
 desc  1.2.0  2018-05-01 [1] CRAN (R 3.6.0)
 devtools  2.2.2  2020-02-17 [1] CRAN (R 3.6.2)
 digest0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
 ellipsis  0.3.0  2019-09-20 [1] CRAN (R 3.6.1)
 fansi 0.4.1  2020-01-08 [1] CRAN (R 3.6.2)
 fs1.3.1  2019-05-06 [1] CRAN (R 3.6.0)
 glue  1.3.1  2019-03-12 [1] CRAN (R 3.6.0)
 magrittr  1.52014-11-22 [1] CRAN (R 3.6.0)
 memoise   1.1.0  2017-04-21 [1] CRAN (R 3.6.0)
 packrat   0.5.0  2018-11-14 [1] CRAN (R 3.6.0)
 pkgbuild  1.0.6  2019-10-09 [1] CRAN (R 3.6.1)
 pkgload   1.0.2  2018-10-29 [1] CRAN (R 3.6.0)
 prettyunits   1.1.1  2020-01-24 [1] CRAN (R 3.6.2)
 processx  3.4.2  2020-02-09 [1] CRAN (R 3.6.2)
 ps1.3.2  2020-02-13 [1] CRAN (R 3.6.2)
 purrr 0.3.3  2019-10-18 [1] CRAN (R 3.6.1)
 R.methodsS3 * 1.8.0  2020-02-14 [1] CRAN (R 3.6.2)
 R.oo* 1.23.0 2019-11-03 [1] CRAN (R 3.6.1)
 R.utils * 2.9.2  2019-12-08 [1] CRAN (R 3.6.1)
 R62.4.1  2019-11-12 [1] CRAN (R 3.6.1)
 Rcpp  1.0.3  2019-11-08 [1] CRAN (R 3.6.1)
 remotes   2.1.1  2020-02-15 [1] CRAN (R 3.6.2)
 rlang 0.4.4  2020-01-28 [1] CRAN (R 3.6.2)
 rprojroot 1.3-2  2018-01-03 [1] CRAN (R 3.6.0)
 rstudioapi0.11   2020-02-07 [1] CRAN (R 3.6.2)
 sessioninfo   1.1.1  2018-11-05 [1] CRAN (R 3.6.0)
 testthat  2.3.1  2019-12-01 [1] CRAN (R 3.6.2)
 tidyselect1.0.0  2020-01-27 [1] CRAN (R 3.6.2)
 usethis   1.5.1.9000 2020-01-31 [1] Github (r-lib/usethis@c31336d)
 vctrs 0.2.3  2020-02-20 [1] CRAN (R 3.6.2)
 withr 2.1.2  2018-03-15 [1] CRAN (R 3.6.0)

[1] C:/Users/salbers/R/win-library/3.6
[2] C:/Program Files/R/R-3.6.3/library
Reporter: Sam Albers


I am writing this as an FYI because it caught me today. My hope is that maybe 
this will one day help solve a bug or act as a clue if/when you encounter this 
behaviour. I don't have any time at the moment to track exactly what is 
happening so unfortunately I am just sharing as is. The issue is when one loads 
the R.utils package after loading arrow. Again likely this is an issue related 
to R.utils and therefore not strictly a bug in arrow. Still thought it would be 
useful to share:

 
{code:java}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp
write_parquet(iris, 'iris.parquet')
pq <- ParquetFileReader$create('iris.parquet')
library(R.utils)
#> Loading required package: R.oo
#> Loading required package: R.methodsS3
#> 

[jira] [Created] (ARROW-7796) arrow::write_* functions should return their inputs

2020-02-07 Thread Sam Albers (Jira)
Sam Albers created ARROW-7796:
-

 Summary: arrow::write_* functions should return their inputs
 Key: ARROW-7796
 URL: https://issues.apache.org/jira/browse/ARROW-7796
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 0.15.1
 Environment: Windows 10, R 3.6.2
Reporter: Sam Albers


 

I am wondering if you'd consider a slight change to what is returned by the 
write_* functions. In \{readr} the write functions return its input which is 
very useful for saving intermediate objects within a pipeline. I'd be happy to 
take this on and submit as a pull request. 
{code:java}

 library(arrow)
 #> 
 #> Attaching package: 'arrow'
 #> The following object is masked from 'package:utils':
 #> 
 #> timestamp
 library(readr)
 #> 
 #> Attaching package: 'readr'
 #> The following object is masked from 'package:arrow':
 #> 
 #> read_table
iris_arrow <- write_parquet(iris, "iris.parquet")
 iris_arrow
 #> NULL
iris_readr <- write_csv(iris, "iris.csv")
head(iris_readr)
 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 #> 1 5.1 3.5 1.4 0.2 setosa
 #> 2 4.9 3.0 1.4 0.2 setosa
 #> 3 4.7 3.2 1.3 0.2 setosa
 #> 4 4.6 3.1 1.5 0.2 setosa
 #> 5 5.0 3.6 1.4 0.2 setosa
 #> 6 5.4 3.9 1.7 0.4 setosa
devtools::session_info()
 #> - Session info 
---
 #> setting value 
 #> version R version 3.6.2 (2019-12-12)
 #> os Windows 10 x64 
 #> system x86_64, mingw32 
 #> ui RTerm 
 #> language (EN) 
 #> collate English_Canada.1252 
 #> ctype English_Canada.1252 
 #> tz America/Los_Angeles 
 #> date 2020-02-07 
 #> 
 #> - Packages 
---
 #> package * version date lib source 
 #> arrow * 0.15.1.20200207 2020-02-07 [1] local 
 #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) 
 #> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) 
 #> bit 1.1-15.1 2020-01-14 [1] CRAN (R 3.6.2) 
 #> bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) 
 #> callr 3.4.1 2020-01-24 [1] CRAN (R 3.6.2) 
 #> cli 2.0.1 2020-01-08 [1] CRAN (R 3.6.2) 
 #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) 
 #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) 
 #> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.2) 
 #> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) 
 #> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) 
 #> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) 
 #> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2) 
 #> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) 
 #> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) 
 #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) 
 #> hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.2) 
 #> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1) 
 #> knitr 1.27 2020-01-16 [1] CRAN (R 3.6.2) 
 #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) 
 #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) 
 #> pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.2) 
 #> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1) 
 #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1) 
 #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) 
 #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2) 
 #> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1) 
 #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) 
 #> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.1) 
 #> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1) 
 #> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1) 
 #> readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.1) 
 #> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1) 
 #> rlang 0.4.3 2020-01-24 [1] CRAN (R 3.6.2) 
 #> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.2) 
 #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) 
 #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) 
 #> stringi 1.4.4 2020-01-09 [1] CRAN (R 3.6.2) 
 #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2) 
 #> testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1) 
 #> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.2) 
 #> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.2) 
 #> usethis 1.5.1.9000 2020-01-31 [1] Github (r-lib/usethis@c31336d)
 #> vctrs 0.2.2 2020-01-24 [1] CRAN (R 3.6.2) 
 #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) 
 #> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2) 
 #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.2) 
 #> 
 #> [1] C:/Users/salbers/R/win-library/3.6
 #> [2] C:/Program Files/R/R-3.6.2/library
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)