[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Tue, 22 Nov 2022 15:33:20 -0800


djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1029911635



##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,100 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface 
to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
+Apache Arrow lets you work efficiently with multi-file data sets even when 
that data set is too large to be loaded into memory. With the help of Arrow 
Dataset objects you can analyze this kind of data using familiar  
[dplyr](https://dplyr.tidyverse.org/) syntax. This article introduces Datasets 
and shows you how to analyze them with dplyr and arrow: we'll start by ensuring 
both packages are loaded
 
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
+```{r}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
 
 ## Example: NYC taxi data
 
-The [New York City taxi trip record 
data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
+The primary motivation for multi-file Datasets is to allow users to analyze 
extremely large datasets. As an example, consider the [New York City taxi trip 
record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) 
that is widely used in big data exercises and competitions. To demonstrate the 
capabilities of Apache Arrow we host a Parquet-formatted version this data in a 
public Amazon S3 bucket: in its full form, our version of the data set is one 
very large table with about 1.7 billion rows and 24 columns, where each row 
corresponds to a single taxi ride sometime between 2009 and 2022. A [data 
dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for 
this version of the NYC taxi data is also available. 

Review Comment:
   These are all great thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to