[
https://issues.apache.org/jira/browse/ARROW-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mauricio 'Pachá' Vargas Sepúlveda updated ARROW-13616:
------------------------------------------------------
Description:
Hi
I've created a folder on Google Drive that contains:
* SVG (Inkscape) drafts for the cheat sheet
* Arrow hex icon (SVG)
* *A document with the proposed text, please feel free to comment here*
Link:
[https://drive.google.com/drive/folders/1YEJdPuhLCwkl8r3hSBxbnP1hYq04fW13?usp=sharing]
Please open it and I'll give access to Voltron and Community collaborators.
was:
h1. *Front page*
h2. About
Apache Arrow is a development platform for in-memory analytics. It contains a
set of technologies that enable big data systems to process and move data fast.
The arrow R package integrates with dplyr and allows you to work with multiple
storage formats as well as data in AWS S3 and other similar cloud storage
systems.
h2. Installation
Our goal is to make the package just work on Windows, Mac and Linux.
*On Windows and Mac:*
{{install.packages("arrow")}}
*On Linux:*
{{Sys.setenv(NOT_CRAN = TRUE)}}
{{install.packages("arrow")}}
h2. Import
Follow the same steps to update.
To read Parquet/Feather data from a directory you can specify a partioning for
efficient filtering:
{{d <- open_dataset("nyc-taxi",}}
\{{ partitioning = c("year",}}
\{{ "month"))}}
{{For *single files* you can do either:}}
{{read_parquet("gapminder.parquet")}}
{{read_feather("gapminder.feather")}}
Arrow can also read large CSV and JSON files with excellent speed and
efficiency:
{{read_csv_arrow("gapminder.csv")}}
{{read_json_arrow("gapminder.json")}}
-This reads data as data.frame.-
h2. Dplyr compatibility
Arrow and dplyr combination allow efficient reading, since dplyr filters "know"
which files to read and what to skip based on the partitioning:
{{d %>%}}
\{{ filter(year == 2009,}}
\{{ month == 1) %>%}}
\{{ collect() %>%}}
\{{ group_by(year,month) %>%}}
\{{ summarise(mean_amount = }}
\{{ mean(total_amount))}}
Collect converts Arrow-type objects into regular tibbles. This then allows you
to use your data with your existing visualisation and analysis workflow.
Arrow in R shares most of the characteristics of SQL in R throught RPostgres
and other packages.
Hint: If an operation is not implemented (yet) in Arrow, you can collect and
then use the operation. For example, mutate is implemented, but summarise and
distinct will be announced later.
h2. Export
When saving data stored in a tibble to parquet format, the default partitioning
is based on any groups in the tibble. To save with partitioning:
{{d2 %>%}}
\{{ write_dataset("nyc-summary",}}
\{{ hive_style = F)}}
This shall create different folders like 2015/01, 2015/02, etc. Hint:
experiment changing hive to TRUE.
You can also save without partitioning:
{{write_parquet(d2, "d2.parquet")}}
{{write_feather(d2, "d2.feather")}}
-To save without partitioning, you can use:-
{{-write_parquet(d2, "mydata.parquet")-}}
{{-write_feather(d2, "mydata.feather")-}}
{{-write_csv_arrow(d2, "mydata.csv")-}}
-The read_ counterparts of these functions work exactly like read_csv.-
h2. S3 support
You can read files from S3 filesystems without having to download them, and
this is done with:
{{d2 <- open_dataset(}}
\{{ "s3://ursa-labs-taxi-data",}}
\{{ partitioning = c("year",}}
\{{ "month"))}}
You can also copy the data to your computer:
{{copy_files(}}
\{{ "s3://ursa-labs-taxi-data", }}
\{{ "~/nyc-taxi")}}
h1. Back page
h2. Generic S3 filesystems?
h2. Specific writing operations?
h2. More on dplyr compatibility?
h2. Mention something you would like to see here
> [R] Cheat Sheet Structure
> -------------------------
>
> Key: ARROW-13616
> URL: https://issues.apache.org/jira/browse/ARROW-13616
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Documentation, R
> Affects Versions: 5.0.0
> Reporter: Mauricio 'Pachá' Vargas Sepúlveda
> Assignee: Mauricio 'Pachá' Vargas Sepúlveda
> Priority: Major
>
> Hi
> I've created a folder on Google Drive that contains:
> * SVG (Inkscape) drafts for the cheat sheet
> * Arrow hex icon (SVG)
> * *A document with the proposed text, please feel free to comment here*
> Link:
> [https://drive.google.com/drive/folders/1YEJdPuhLCwkl8r3hSBxbnP1hYq04fW13?usp=sharing]
> Please open it and I'll give access to Voltron and Community collaborators.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)