[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

GitBox Thu, 29 Jul 2021 04:41:36 -0700


thisisnic commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r679073277




##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and 
will
 yield previously cached output for reference. To be explicit about which 
version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of 
data.
+The first step is to create a Dataset object, pointing at the directory of 
data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as 
Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and 
`"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the 
call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow 
file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to 
specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide 
information
-about how the dataset is chunked into different files. Our files in this 
example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I have no recollection of that but I'll have a search on JIRA and see 
what I can find




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Reply via email to