oliviermeslin commented on code in PR #40982: URL: https://github.com/apache/arrow/pull/40982#discussion_r1567018761
########## r/vignettes/informal_introduction.Rmd: ########## @@ -0,0 +1,297 @@ +--- +title: Getting started with Apache Arrow and R +description: > + An informal introduction of the functioning of Apache Arrow for R users +output: rmarkdown::html_vignette +--- + + +## What is this vignette and is it for you? + +This vignette provides an overview of how Arrow works in a plain, non-technical language. It aims at giving some simple and useful intuitions, before diving deeper in the documentation. It is specifically intended for newcomers with a limited background in computer science and hence avoids most technical terms. This vignette assumes that you have some experience with R, that you are familiar with the `dplyr` syntax and that you know how to use a Parquet file. + +Some technical points are deliberately simplified to keep things simple, so in case you find an apparent contradiction between this vignette and the rest of the documentation, please trust the documentation rather than this vignette. + +## Introducing Apache Arrow + +This section introduces the Apache Arrow project. + +### What is Apache Arrow? + +[Apache `Arrow`](https://arrow.apache.org/) is an *open-source* project that offers two things: + +- Apache Arrow defines a standardized way to organize data in memory (called _Apache Arrow Columnar Format_). You do not need to know much about this _Columnar Format_ to use Arrow with R, except that it is very efficient (processing is fast) and interoperable (meaning for instance that both R and Python can access the same data, without converting the data from one format to another). +- Apache Arrow offers a `C++` implementation of this _Columnar Format_: the `C++` library called `libarrow`. + +What about the Arrow R package then? This package simply makes it possible to use the `libarrow` library within `R`. Keep in mind that there are other similar interfaces for using `libarrow` with other programming languages: in Python, Java, Javascript, Julia, and so on. But __no matter what programming language you choose for using Arrow, remember that under the hood you are using exactly the same tool: the C++ libarrow library__. + +### What's so special about Arrow? + +Arrow has five distinctive features: + +- __Columnar Memory Format__: in Arrow, data is organized in columns rather than in rows (hence the "columnar format"). In practice, it means that all values of the first column are stored contiguously in memory, followed by all values of the second column, and so on. This columnar format speeds up data processing. Imagine that you want to calculate the mean of a variable: you can directly access the block of memory containing the entire column and get the result, no matter how many columns you have in your dataset. + +- __Easy use with Parquet files__: Arrow is optimized to work well with data stored in Parquet files. + +- __Ability to process very large datasets__: Arrow is able to process very large amounts of data, even datasets that are too large to fit in the memory of your computer. + +- __Interoperability__: Arrow is designed to be interoperable between several programming languages such as `R`, Python, Java, C++, etc. This means that data can be exchanged between different programming languages without converting the data from one format to another, resulting in significant performance gains. + +- __*Lazy Evaluation*__: when you give instructions to Arrow, Arrow stores them but does not run them, unless you explicitly ask it to do so (more on this below). Review Comment: @eitsupi : you are right. This vignette reflects my own, imperfect understanding of how arrow works, and clearly I do not completely understand how all components interact. I tried to reformulate. Let me know if you want reformulate yourself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
