[ 
https://issues.apache.org/jira/browse/ARROW-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13616:
------------------------------------
    Component/s: Documentation

> [R] Cheat Sheet Structure
> -------------------------
>
>                 Key: ARROW-13616
>                 URL: https://issues.apache.org/jira/browse/ARROW-13616
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Documentation, R
>    Affects Versions: 5.0.0
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Priority: Major
>
> h1. *Front page*
> h2. About
> Apache Arrow is a development platform for in-memory analytics. It contains a 
> set of technologies that enable big data systems to process and move data 
> fast.
> The arrow R package  integrates with dplyr and allows you to work with 
> multiple storage formats as well as data in AWS S3 and other similar cloud 
> storage systems.
> h2. Installation
> Our goal is to make the package just work on Windows, Mac and Linux.
> *On Windows and Mac:*
> {{install.packages("arrow")}}
> On Linux:
> {{Sys.setenv(NOT_CRAN = TRUE)}}
> {{install.packages("arrow")}}
> h2. Import
> Follow the same steps to update.
> To read Parquet/Feather data from a directory you can specify a partioning 
> for efficient filtering:
> {{d <- open_dataset("nyc-taxi",}}
> {{ partitioning = c("year",}}
> {{ "month"))}}
> {{For *single files* you can do either:}}
> {{read_parquet("gapminder.parquet")}}
> {{read_feather("gapminder.feather")}}
> Arrow can also read large CSV and JSON files with excellent speed and 
> efficiency: 
> {{read_csv_arrow("gapminder.csv")}}
> {{read_json_arrow("gapminder.json")}}
> -This reads data as data.frame.-
> h2. Dplyr compatibility
> Arrow and dplyr combination allow efficient reading, since dplyr filters 
> "know" which files to read and what to skip based on the partitioning:
> {{d %>%}}
> {{ filter(year == 2009,}}
> {{ month == 1) %>%}}
> {{ collect() %>%}}
> {{ group_by(year,month) %>%}}
> {{ summarise(mean_amount = }}
> {{ mean(total_amount))}}
> Collect converts Arrow-type objects into regular tibbles. This then allows 
> you to use your data with your existing visualisation and analysis workflow.
> Arrow in R shares most of the characteristics of SQL in R throught RPostgres 
> and other packages.
> Hint: If an operation is not implemented (yet) in Arrow, you can collect and 
> then use the operation. For example, mutate is implemented, but summarise and 
> distinct will be announced later.
> h2. Export
> When saving data stored in a tibble to parquet format, the default 
> partitioning is based on any groups in the tibble. To save with partitioning:
> {{d2 %>%}}
> {{ write_dataset("nyc-summary",}}
> {{ hive_style = F)}}
> This shall create different folders like 2015/01, 2015/02, etc. Hint: 
> experiment changing hive to TRUE.
> You can also save without partitioning:
> {{write_parquet(d2, "d2.parquet")}}
> {{write_feather(d2, "d2.feather")}}
> -To save without partitioning, you can use:-
> {{-write_parquet(d2, "mydata.parquet")-}}
> {{-write_feather(d2, "mydata.feather")-}}
> {{-write_csv_arrow(d2, "mydata.csv")-}}
> -The read_ counterparts of these functions work exactly like read_csv.-
> h2. S3 support
> You can read files from S3 filesystems without having to download them, and 
> this is done with:
> {{d2 <- open_dataset(}}
> {{ "s3://ursa-labs-taxi-data",}}
> {{ partitioning = c("year",}}
> {{ "month"))}}
> You can also copy the data to your computer:
> {{copy_files(}}
> {{ "s3://ursa-labs-taxi-data", }}
> {{ "~/nyc-taxi")}}
> h1. Back page
> h2. Generic S3 filesystems?
> h2. Specific writing operations?
> h2. More on dplyr compatibility?
> h2. Mention something you would like to see here
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to