[
https://issues.apache.org/jira/browse/ARROW-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-13616:
------------------------------------
Component/s: Documentation
> [R] Cheat Sheet Structure
> -------------------------
>
> Key: ARROW-13616
> URL: https://issues.apache.org/jira/browse/ARROW-13616
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Documentation, R
> Affects Versions: 5.0.0
> Reporter: Mauricio 'Pachá' Vargas Sepúlveda
> Priority: Major
>
> h1. *Front page*
> h2. About
> Apache Arrow is a development platform for in-memory analytics. It contains a
> set of technologies that enable big data systems to process and move data
> fast.
> The arrow R package integrates with dplyr and allows you to work with
> multiple storage formats as well as data in AWS S3 and other similar cloud
> storage systems.
> h2. Installation
> Our goal is to make the package just work on Windows, Mac and Linux.
> *On Windows and Mac:*
> {{install.packages("arrow")}}
> On Linux:
> {{Sys.setenv(NOT_CRAN = TRUE)}}
> {{install.packages("arrow")}}
> h2. Import
> Follow the same steps to update.
> To read Parquet/Feather data from a directory you can specify a partioning
> for efficient filtering:
> {{d <- open_dataset("nyc-taxi",}}
> {{ partitioning = c("year",}}
> {{ "month"))}}
> {{For *single files* you can do either:}}
> {{read_parquet("gapminder.parquet")}}
> {{read_feather("gapminder.feather")}}
> Arrow can also read large CSV and JSON files with excellent speed and
> efficiency:
> {{read_csv_arrow("gapminder.csv")}}
> {{read_json_arrow("gapminder.json")}}
> -This reads data as data.frame.-
> h2. Dplyr compatibility
> Arrow and dplyr combination allow efficient reading, since dplyr filters
> "know" which files to read and what to skip based on the partitioning:
> {{d %>%}}
> {{ filter(year == 2009,}}
> {{ month == 1) %>%}}
> {{ collect() %>%}}
> {{ group_by(year,month) %>%}}
> {{ summarise(mean_amount = }}
> {{ mean(total_amount))}}
> Collect converts Arrow-type objects into regular tibbles. This then allows
> you to use your data with your existing visualisation and analysis workflow.
> Arrow in R shares most of the characteristics of SQL in R throught RPostgres
> and other packages.
> Hint: If an operation is not implemented (yet) in Arrow, you can collect and
> then use the operation. For example, mutate is implemented, but summarise and
> distinct will be announced later.
> h2. Export
> When saving data stored in a tibble to parquet format, the default
> partitioning is based on any groups in the tibble. To save with partitioning:
> {{d2 %>%}}
> {{ write_dataset("nyc-summary",}}
> {{ hive_style = F)}}
> This shall create different folders like 2015/01, 2015/02, etc. Hint:
> experiment changing hive to TRUE.
> You can also save without partitioning:
> {{write_parquet(d2, "d2.parquet")}}
> {{write_feather(d2, "d2.feather")}}
> -To save without partitioning, you can use:-
> {{-write_parquet(d2, "mydata.parquet")-}}
> {{-write_feather(d2, "mydata.feather")-}}
> {{-write_csv_arrow(d2, "mydata.csv")-}}
> -The read_ counterparts of these functions work exactly like read_csv.-
> h2. S3 support
> You can read files from S3 filesystems without having to download them, and
> this is done with:
> {{d2 <- open_dataset(}}
> {{ "s3://ursa-labs-taxi-data",}}
> {{ partitioning = c("year",}}
> {{ "month"))}}
> You can also copy the data to your computer:
> {{copy_files(}}
> {{ "s3://ursa-labs-taxi-data", }}
> {{ "~/nyc-taxi")}}
> h1. Back page
> h2. Generic S3 filesystems?
> h2. Specific writing operations?
> h2. More on dplyr compatibility?
> h2. Mention something you would like to see here
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)