oliviermeslin commented on code in PR #40982:
URL: https://github.com/apache/arrow/pull/40982#discussion_r1567018761


##########
r/vignettes/informal_introduction.Rmd:
##########
@@ -0,0 +1,297 @@
+---
+title: Getting started with Apache Arrow and R  
+description: >
+  An informal introduction of the functioning of Apache Arrow for R users
+output: rmarkdown::html_vignette
+---
+
+
+## What is this vignette and is it for you?
+
+This vignette provides an overview of how Arrow works in a plain, 
non-technical language. It aims at giving some simple and useful intuitions, 
before diving deeper in the documentation. It is specifically intended for 
newcomers with a limited background in computer science and hence avoids most 
technical terms. This vignette assumes that you have some experience with R, 
that you are familiar with the `dplyr` syntax and that you know how to use a 
Parquet file.
+
+Some technical points are deliberately simplified to keep things simple, so in 
case you find an apparent contradiction between this vignette and the rest of 
the documentation, please trust the documentation rather than this vignette.
+
+## Introducing Apache Arrow
+
+This section introduces the Apache Arrow project.
+
+### What is Apache Arrow?
+
+[Apache `Arrow`](https://arrow.apache.org/) is an *open-source* project that 
offers two things: 
+
+- Apache Arrow defines a standardized way to organize data in memory (called 
_Apache Arrow Columnar Format_). You do not need to know much about this 
_Columnar Format_ to use Arrow with R, except that it is very efficient 
(processing is fast) and interoperable (meaning for instance that both R and 
Python can access the same data, without converting the data from one format to 
another).
+- Apache Arrow offers a `C++` implementation of this _Columnar Format_: the 
`C++` library called `libarrow`.
+
+What about the Arrow R package then? This package simply makes it possible to 
use the `libarrow` library within `R`. Keep in mind that there are other 
similar interfaces for using `libarrow` with other programming languages: in 
Python, Java, Javascript, Julia, and so on. But __no matter what programming 
language you choose for using Arrow, remember that under the hood you are using 
exactly the same tool: the C++ libarrow library__.
+
+### What's so special about Arrow?
+
+Arrow has five distinctive features:
+
+- __Columnar Memory Format__: in Arrow, data is organized in columns rather 
than in rows (hence the "columnar format"). In practice, it means that all 
values of the first column are stored contiguously in memory, followed by all 
values of the second column, and so on. This columnar format speeds up data 
processing. Imagine that you want to calculate the mean of a variable: you can 
directly access the block of memory containing the entire column and get the 
result, no matter how many columns you have in your dataset.
+
+- __Easy use with Parquet files__: Arrow is optimized to work well with data 
stored in Parquet files.
+
+- __Ability to process very large datasets__: Arrow is able to process very 
large amounts of data, even datasets that are too large to fit in the memory of 
your computer.
+
+- __Interoperability__: Arrow is designed to be interoperable between several 
programming languages such as `R`, Python, Java, C++, etc. This means that data 
can be exchanged between different programming languages without converting the 
data from one format to another, resulting in significant performance gains.
+
+- __*Lazy Evaluation*__: when you give instructions to Arrow, Arrow stores 
them but does not run them, unless you explicitly ask it to do so (more on this 
below).

Review Comment:
   @eitsupi : you are right. This vignette reflects my own, imperfect 
understanding of how arrow works, and clearly I do not completely understand 
how all components interact. I tried to reformulate. Let me know if you want 
reformulate yourself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to