[jira] [Updated] (ARROW-13616) [R] Cheat Sheet Structure

Jira Thu, 12 Aug 2021 09:35:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mauricio 'Pachá' Vargas Sepúlveda updated ARROW-13616:
------------------------------------------------------
    Description: 
Hi

I've created a folder on Google Drive that contains:
 * SVG (Inkscape) drafts for the cheat sheet
 * Arrow hex icon (SVG)
 * *A document with the proposed text, please feel free to comment here*

Link: 
[https://drive.google.com/drive/folders/1YEJdPuhLCwkl8r3hSBxbnP1hYq04fW13?usp=sharing]

Please open it and I'll give access to Voltron and Community collaborators.

 

  was:
h1. *Front page*
h2. About

Apache Arrow is a development platform for in-memory analytics. It contains a 
set of technologies that enable big data systems to process and move data fast.

The arrow R package  integrates with dplyr and allows you to work with multiple 
storage formats as well as data in AWS S3 and other similar cloud storage 
systems.
h2. Installation

Our goal is to make the package just work on Windows, Mac and Linux.

*On Windows and Mac:*

{{install.packages("arrow")}}

*On Linux:*

{{Sys.setenv(NOT_CRAN = TRUE)}}

{{install.packages("arrow")}}
h2. Import

Follow the same steps to update.

To read Parquet/Feather data from a directory you can specify a partioning for 
efficient filtering:

{{d <- open_dataset("nyc-taxi",}}
 \{{ partitioning = c("year",}}
 \{{ "month"))}}

{{For *single files* you can do either:}}
 {{read_parquet("gapminder.parquet")}}
 {{read_feather("gapminder.feather")}}

Arrow can also read large CSV and JSON files with excellent speed and 
efficiency: 
 {{read_csv_arrow("gapminder.csv")}}
 {{read_json_arrow("gapminder.json")}}

-This reads data as data.frame.-
h2. Dplyr compatibility

Arrow and dplyr combination allow efficient reading, since dplyr filters "know" 
which files to read and what to skip based on the partitioning:

{{d %>%}}
 \{{ filter(year == 2009,}}
 \{{ month == 1) %>%}}
 \{{ collect() %>%}}
 \{{ group_by(year,month) %>%}}
 \{{ summarise(mean_amount = }}
 \{{ mean(total_amount))}}

Collect converts Arrow-type objects into regular tibbles. This then allows you 
to use your data with your existing visualisation and analysis workflow.

Arrow in R shares most of the characteristics of SQL in R throught RPostgres 
and other packages.

Hint: If an operation is not implemented (yet) in Arrow, you can collect and 
then use the operation. For example, mutate is implemented, but summarise and 
distinct will be announced later.
h2. Export

When saving data stored in a tibble to parquet format, the default partitioning 
is based on any groups in the tibble. To save with partitioning:

{{d2 %>%}}
 \{{ write_dataset("nyc-summary",}}
 \{{ hive_style = F)}}

This shall create different folders like 2015/01, 2015/02, etc. Hint: 
experiment changing hive to TRUE.

You can also save without partitioning:

{{write_parquet(d2, "d2.parquet")}}
 {{write_feather(d2, "d2.feather")}}

-To save without partitioning, you can use:-
 {{-write_parquet(d2, "mydata.parquet")-}}
 {{-write_feather(d2, "mydata.feather")-}}
 {{-write_csv_arrow(d2, "mydata.csv")-}}
 -The read_ counterparts of these functions work exactly like read_csv.-
h2. S3 support

You can read files from S3 filesystems without having to download them, and 
this is done with:

{{d2 <- open_dataset(}}
 \{{ "s3://ursa-labs-taxi-data",}}
 \{{ partitioning = c("year",}}
 \{{ "month"))}}

You can also copy the data to your computer:

{{copy_files(}}
 \{{ "s3://ursa-labs-taxi-data", }}
 \{{ "~/nyc-taxi")}}
h1. Back page
h2. Generic S3 filesystems?
h2. Specific writing operations?
h2. More on dplyr compatibility?
h2. Mention something you would like to see here

 

 

 

 

 

 


> [R] Cheat Sheet Structure
> -------------------------
>
>                 Key: ARROW-13616
>                 URL: https://issues.apache.org/jira/browse/ARROW-13616
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Documentation, R
>    Affects Versions: 5.0.0
>            Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>            Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>            Priority: Major
>
> Hi
> I've created a folder on Google Drive that contains:
>  * SVG (Inkscape) drafts for the cheat sheet
>  * Arrow hex icon (SVG)
>  * *A document with the proposed text, please feel free to comment here*
> Link: 
> [https://drive.google.com/drive/folders/1YEJdPuhLCwkl8r3hSBxbnP1hYq04fW13?usp=sharing]
> Please open it and I'll give access to Voltron and Community collaborators.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13616) [R] Cheat Sheet Structure

Reply via email to