fsaintjacques commented on a change in pull request #63: URL: https://github.com/apache/arrow-site/pull/63#discussion_r439406667
########## File path: faq.md ########## @@ -24,32 +24,155 @@ limitations under the License. # Frequently Asked Questions +## General + +### What *is* Arrow? + +Arrow is an open standard for how to represent columnar data in memory, along +with libraries in many languages that implement that standard. The Arrow format +allows different programs and runtimes, perhaps written in different languages, +to share data efficiently using a set of rich data types (included nested +and user-defined data types). The Arrow libraries make it easy to write such +programs, by sparing the programmer from implementing low-level details of the +Arrow format. + +Arrow additionally defines a streaming format and a file format for +inter-process communication (IPC), based on the in-memory format. It also +defines a generic client-server RPC mechanism (Arrow Flight), based on the +IPC format, and implemented on top of the gRPC framework. <!-- TODO links --> + +### Why create a new standard? + +<!-- Fill this in --> + +## Project status + ### How stable is the Arrow format? Is it safe to use in my application? +<!-- Revise this --> + +The Arrow *in-memory format* is considered stable, and we intend to make only +backwards-compatible changes, such as additional data types. It is used by +many applications already, and you can trust that compatibility will not be +broken. + +The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended +for long-term disk persistence of data; that said, it is perfectly acceptable +to write Arrow memory to disk for purposes of memory mapping and caching. + +We encourage people to start building Arrow-based in-memory computing +applications now, and choose a suitable file format for disk storage +if necessary. The Arrow libraries include adapters for several file formats, +including Parquet, ORC, CSV, and JSON. + +### How stable are the Arrow libraries? + +We refer you to the [implementation matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst). + +## Getting started + +### Where can I get Arrow libraries? + +Arrow libraries for many languages are available through the usual package +managers. See the [install]({{ site.baseurl }}/install/) page for specifics. -The Arrow in-memory format is considered stable, and we intend to make only backwards-compatible changes, such as additional data types. We do not yet recommend the Arrow file format for long-term disk persistence of data; that said, it is perfectly acceptable to write Arrow memory to disk for purposes of memory mapping and caching. +## Getting involved -We encourage people to start building Arrow-based in-memory computing applications now, and choose a suitable file format for disk storage if necessary. The Arrow libraries include adapters for several file formats, including Parquet, ORC, CSV, and JSON. +### I have some questions. How can I get help? + +The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place +to ask questions. Don't be shy--we're here to help. + +### I tried to use Arrow and it didn't work. Can you fix it? + +Hopefully! Please make a detailed bug report--that's a valuable contribution +to the project itself. +See the [contribution guidelines]({{ site.baseurl }}/docs/developers/contributing.html) +for how to make a report. + +### Arrow looks great and I'd totally use it if it only did X. When will it be done? + +We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue tracker. +Search for an issue that matches your need. If you find one, feel free to +comment on it and describe your use case--that will help whoever picks up +the task. If you don't find one, make it. + +Ultimately, Arrow is software written by and for the community. If you don't +see someone else in the community working on your issue, the best way to get +it done is to pitch in yourself. We're more than willing to help you contribute +successfully to the project. + +### How can I report a security vulnerability? + +Please send an email to [[email protected]](mailto:[email protected]). +See the [security]({{ site.baseurl }}/security/) page for more. + +## Relation to other projects ### What is the difference between Apache Arrow and Apache Parquet? +<!-- Revise this --> + +Parquet is a storage format designed for maximum space efficiency, using +advanced compression and encoding techniques. It is ideal when wanting to +minimize disk usage while storing gigabytes of data, or perhaps more. +This efficiency comes at the cost of relatively expensive reading into memory, +as Parquet data cannot be directly operated on but must be decoded in +large chunks. + +Conversely, Arrow is an in-memory format meant for direct and efficient use +for computational purposes. Arrow data is not compressed (or only lightly so, +when using dictionary encoding) but laid out in natural format for the CPU, +so that data can be accessed at arbitrary places at full speed. + +Therefore, Arrow and Parquet are not competitors: they complement each other +and are commonly used together in applications. Storing your data on disk +using Parquet, and reading it into memory in the Arrow format, will allow +you to make the most of your computing hardware. -In short, Parquet files are designed for disk storage, while Arrow is designed for in-memory use, but you can put it on disk and then memory-map later. Arrow and Parquet are intended to be compatible with each other and used together in applications. +### What about "Arrow files" then? -Parquet is a columnar file format for data serialization. Reading a Parquet file requires decompressing and decoding its contents into some kind of in-memory data structure. It is designed to be space/IO-efficient at the expensive CPU utilization for decoding. It does not provide any data structures for in-memory computing. Parquet is a streaming format which must be decoded from start-to-end; while some "index page" facilities have been added to the storage format recently, random access operations are generally costly. +Apache Arrow defines an inter-process communication (IPC) mechanism to +transfer a collection of Arrow columnar arrays (called a "record batch"). +It can be used synchronously between processes using the Arrow "stream format", +or asynchronously by first persisting data on storage using the Arrow "file format". -Arrow on the other hand is first and foremost a library providing columnar data structures for *in-memory computing*. When you read a Parquet file, you can decompress and decode the data *into* Arrow columnar data structures so that you can then perform analytics in-memory on the decoded data. The Arrow columnar format has some nice properties: random access is O(1) and each value cell is next to the previous and following one in memory, so it's efficient to iterate over. +The Arrow IPC mechanism is based on the Arrow in-memory format, such that +there is no translation necessary between the on-disk representation and +the in-memory representation. Therefore, performing analytics on an Arrow +IPC file can use memory-mapping and pay effectively zero cost. Review comment: ```suggestion IPC file can use memory-mapping, avoiding any deserialization cost and extra copies. ``` You still need to copy from disk to memory. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
