wesm commented on a change in pull request #63:
URL: https://github.com/apache/arrow-site/pull/63#discussion_r445134010
##########
File path: _includes/header.html
##########
@@ -50,22 +33,44 @@
</a>
<div class="dropdown-menu"
aria-labelledby="navbarDropdownDocumentation">
<a class="dropdown-item" href="{{ site.baseurl }}/docs">Project
Docs</a>
- <a class="dropdown-item" href="{{ site.baseurl
}}/docs/python">Python</a>
+ <a class="dropdown-item" href="{{ site.baseurl
}}/docs/format/Columnar.html">Specification</a>
+ <hr/>
+ <a class="dropdown-item" href="{{ site.baseurl }}/docs/c_glib">C
GLib</a>
<a class="dropdown-item" href="{{ site.baseurl }}/docs/cpp">C++</a>
+ <a class="dropdown-item"
href="https://github.com/apache/arrow/blob/master/csharp/README.md">C#</a>
+ <a class="dropdown-item"
href="https://godoc.org/github.com/apache/arrow/go/arrow">Go</a>
<a class="dropdown-item" href="{{ site.baseurl
}}/docs/java">Java</a>
- <a class="dropdown-item" href="{{ site.baseurl }}/docs/c_glib">C
GLib</a>
<a class="dropdown-item" href="{{ site.baseurl
}}/docs/js">JavaScript</a>
+ <a class="dropdown-item"
href="https://github.com/apache/arrow/blob/master/matlab/README.md">MATLAB</a>
+ <a class="dropdown-item" href="{{ site.baseurl
}}/docs/python">Python</a>
<a class="dropdown-item" href="{{ site.baseurl }}/docs/r">R</a>
+ <a class="dropdown-item"
href="https://github.com/apache/arrow/blob/master/ruby/README.md">Ruby</a>
+ <a class="dropdown-item"
href="https://docs.rs/crate/arrow/">Rust</a>
+ </div>
+ </li>
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownCommunity" role="button" data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Community
+ </a>
+ <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+ <a class="dropdown-item" href="{{ site.baseurl
}}/community/">Mailing Lists</a>
Review comment:
"Communications"?
##########
File path: _includes/header.html
##########
@@ -50,22 +33,44 @@
</a>
<div class="dropdown-menu"
aria-labelledby="navbarDropdownDocumentation">
<a class="dropdown-item" href="{{ site.baseurl }}/docs">Project
Docs</a>
- <a class="dropdown-item" href="{{ site.baseurl
}}/docs/python">Python</a>
+ <a class="dropdown-item" href="{{ site.baseurl
}}/docs/format/Columnar.html">Specification</a>
Review comment:
"Columnar Format"?
##########
File path: community.md
##########
@@ -0,0 +1,73 @@
+---
+layout: default
+title: Apache Arrow Community
+description: Links and resources for participating in Apache Arrow
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Apache Arrow Community
+
+We welcome participation from everyone and encourage you to join us, ask
questions, and get involved.
+
+All participation in the Apache Arrow project is governed by the Apache
Software Foundation's [code of
conduct](https://www.apache.org/foundation/policies/conduct.html).
+
+## Questions?
+
+### Mailing lists
+
+These arrow.apache.org mailing lists are for project discussion:
+
+<ul>
+ <li> <code>user@</code> is for questions on using Apache Arrow libraries {%
include mailing_list_links.html list="user" %} </li>
+ <li> <code>dev@</code> is for discussions about contributing to the project
development {% include mailing_list_links.html list="dev" %} </li>
+</ul>
+
+When emailing one of the lists, you may want to prefix the subject line with
one or more tags, like `[C++] why did this segfault?`, `[Python] trouble with
wheels`, etc., so that the appropriate people in the community notice the
message.
+
+You may also wish to subscript to these lists, which capture some activity
streams:
Review comment:
subscribe
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
Review comment:
Let's link to the versioning backward/forward compatibility guarantees
in the docs
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
Review comment:
Here's a reframing -- I have been encouraging us to move away from
creating a false equivalence between "Apache Arrow The Project" and the "Arrow
Columnar Format". So anyplace where someone might say "Arrow _is_ the columnar
format" we should correct them to say that "Arrow _contains_ a columnar
format". Please edit / wordsmith as desired
Apache Arrow is a software development platform for building high
performance applications that process and transport large data sets. It is
designed to both improve the performance of analytical algorithms and the
efficiency of moving data from one system (or programming language to another).
A critical component of Apache Arrow is its **in-memory columnar format**, a
standardized language-agnostic data structure specification for representing
structured, table-like datasets in-memory. This data format has a rich data
type system (included nested and user-defined data types) designed to support
the needs of analytic database systems, data frame libraries, and more. The
project contains many implementation of the Arrow columnar format along with
utilities for reading and writing it to many common storage formats.
We do not anticipate that many third-party projects will choose to implement
the Arrow columnar format themselves, instead choosing to depend on one of the
official libraries. For projects that want to implement a small subset of the
format, we have created some tools (like a C data interface) to assist with
interoperability with the official Arrow libraries.
The Arrow libraries contain many software components that assist with
systems problems related to getting data in and out of remote storage systems
and moving Arrow-formatted data over network interfaces. Some of these
components can be used even in scenarios where the columnar format is not used
at all.
Lastly, alongside software that helps with data access and IO-related
issues, there are libraries of algorithms for performing analytical operations
or queries against Arrow datasets.
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
Review comment:
I think this para can be removed as of 1.0.0
##########
File path: _layouts/home.html
##########
@@ -0,0 +1,21 @@
+{% include top.html %}
+
+<body class="wrap">
+ <header>
+ {% include header.html %}
+ </header>
+ <div class="big-arrow-bg">
+ <div class="container p-lg-4 centered">
+ <img src="{{ site.baseurl }}/img/arrow-inverse.png" style="max-width:
80%;"/>
Review comment:
Smaller also better imho
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
Review comment:
I don't think we need to hedge regarding people storage Arrow data on
disk starting with 1.0.0. We should state explicitly here however that we don't
intend for Arrow to be replacement for Parquet (an exceedingly common question)
and where relevant the columnar format makes trade-offs to support the
performance requirements of in-memory analytics over purely file storage
considerations. Parquet is not a "runtime in-memory format" and file formats
almost always have to be deserialized into some in-memory data structure for
processing, and we intend for Arrow to be that in-memory data structure
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
-In some applications, Parquet and Arrow can be used interchangeably for
on-disk data serialization. Some things to keep in mind:
+* Reading Parquet files generally requires expensive decoding, while reading
Review comment:
"expensive" is in the eye of the beholder. How about "requires
efficient, but relatively complex decoding"
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
Review comment:
"We are not yet making this assertion about long-term stability of the
Arrow format."
--> "While the Arrow on-disk format is stable and will be readable by future
versions of the libraries, it is not intended for long-term archival storage."
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
-In some applications, Parquet and Arrow can be used interchangeably for
on-disk data serialization. Some things to keep in mind:
+* Reading Parquet files generally requires expensive decoding, while reading
+ Arrow IPC files is just a matter of transferring raw bytes from the storage
+ hardware.
Review comment:
Instead of "just a matter of transferring raw bytes from the storage
hardware." how about the more precise statement "reading Arrow IPC files does
not involve any decoding because the on-disk representation is the same as the
in-memory representation."
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
Review comment:
perhaps merge this with some of the thoughts above
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
-In some applications, Parquet and Arrow can be used interchangeably for
on-disk data serialization. Some things to keep in mind:
+* Reading Parquet files generally requires expensive decoding, while reading
+ Arrow IPC files is just a matter of transferring raw bytes from the storage
+ hardware.
-* Parquet is intended for "archival" purposes, meaning if you write a file
today, we expect that any system that says they can "read Parquet" will be able
to read the file in 5 years or 7 years. We are not yet making this assertion
about long-term stability of the Arrow format.
-* Parquet is generally a lot more expensive to read because it must be decoded
into some other data structure. Arrow protocol data can simply be memory-mapped.
-* Parquet files are often much smaller than Arrow-protocol-on-disk because of
the data encoding schemes that Parquet uses. If your disk storage or network is
slow, Parquet may be a better choice.
+* Parquet files are often much smaller than Arrow IPC files because of the
+ elaborate encoding schemes that Parquet uses. If your disk storage or network
+ is slow, Parquet may be a better choice even for short-term storage or
caching.
+
+### What about the "Feather" file format?
+
+The Feather v1 format started as a separate specification, but the Feather v2
+format is just another, easier to remember name for the Arrow IPC file format.
### How does Arrow relate to Flatbuffers?
-Flatbuffers is a domain-agnostic low-level building block for binary data
formats. It cannot be used directly for data analysis tasks without a lot of
manual scaffolding. Arrow is a data layer aimed directly at the needs of data
analysis, providing elaborate data types (including extensible logical types),
built-in support for "null" values (a.k.a "N/A"), and an expanding toolbox of
I/O and computing facilities.
+Flatbuffers is a low-level building block for binary data serialization.
+It is not adapted to the representation of large, structured, homogenous
+data, and does not sit at the right abstraction layer for data analysis tasks.
+
+Arrow is a data layer aimed directly at the needs of data analysis, providing
+elaborate data types (including extensible logical types), built-in support
Review comment:
Use a more neutral word than "elaborate". How about, "providing a
comprehensive collection of data types required to analytics" or something
similar
##########
File path: index.html
##########
@@ -1,72 +1,62 @@
---
-layout: default
+layout: home
---
-<div class="jumbotron">
- <h1>Apache Arrow</h1>
- <p class="lead">A cross-language development platform for in-memory
data</p>
- <p>
- <a class="btn btn-lg btn-success" style="white-space: normal;"
href="mailto:[email protected]" role="button">Join Mailing List</a>
- <a class="btn btn-lg btn-primary" style="white-space: normal;"
href="{{ site.baseurl }}/install/" role="button">Install
({{site.data.versions['current'].number}} Release -
{{site.data.versions['current'].date}})</a>
- </p>
-</div>
-<h5>
- Interested in contributing?
- <small class="text-muted">Join the <a
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/"><strong>mailing
list</strong></a> or check out the <a
href="https://cwiki.apache.org/confluence/display/ARROW"><strong>developer
wiki</strong></a>.</small>
-</h5>
-<h5>
- <a href="{{ site.baseurl }}/blog/"><strong>See Latest News</strong></a>
-</h5>
-<p>
- {{ site.description }}
-</p>
-<hr />
+<h1>What is Arrow?</h1>
<div class="row">
<div class="col-lg-4">
- <h2 class="mt-3">Fast</h2>
- <p>Apache Arrow™ enables execution engines to take advantage of
the latest SIMD (Single instruction, multiple data) operations included in
modern processors, for native vectorized optimization of analytical data
processing. Columnar layout is optimized for data locality for better
performance on modern hardware like CPUs and GPUs.</p>
- <p>The Arrow memory format supports <strong>zero-copy reads</strong> for
lightning-fast data access without serialization overhead.</p>
+ <h2 class="mt-3">Format</h2>
+ <p>Apache Arrow defines a language-independent columnar memory format
for flat and hierarchical data, organized for efficient analytic operations on
modern hardware like CPUs and GPUs. The Arrow memory format also supports
<strong>zero-copy reads</strong> for lightning-fast data access without
serialization overhead.</p>
+ <p><a href="{{ site.baseurl }}/overview/">Learn more</a> about the
design or
+ <a href="{{ site.baseurl }}/docs/format/Columnar.html">read the
specification</a>.</p>
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Flexible</h2>
- <p>Arrow acts as a new high-performance interface between various
systems. It is also focused on supporting a wide variety of industry-standard
programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R,
Ruby, and Rust implementations are in progress and more languages are welcome.
+ <h2 class="mt-3">Libraries</h2>
+ <p>The Arrow project includes libraries that implement the memory
specification in many languages. They enable you to use the Arrow format as an
efficient means of sharing data across languages and processes. Libraries are
available for <a href="{{ site.baseurl }}/docs/c_glib/">C</a>, <a href="{{
site.baseurl }}/docs/cpp/">C++</a>, <a
href="https://github.com/apache/arrow/blob/master/csharp/README.md">C#</a>, <a
href="https://godoc.org/github.com/apache/arrow/go/arrow">Go</a>, <a href="{{
site.baseurl }}/docs/java/">Java</a>, <a href="{{ site.baseurl
}}/docs/js/">JavaScript</a>, <a
href="https://github.com/apache/arrow/blob/master/matlab/README.md">MATLAB</a>,
<a href="{{ site.baseurl }}/docs/python/">Python</a>, <a href="{{ site.baseurl
}}/docs/r/">R</a>, <a
href="https://github.com/apache/arrow/blob/master/ruby/README.md">Ruby</a>, and
<a href="https://docs.rs/crate/arrow/">Rust</a>.
Review comment:
Arrow's libraries provide building blocks for creating high performance
analytics applications. The libraries implement the Arrow columnar format and
address a wide spectrum of problems related to data access, in-memory data
management, and analytical query processing.
##########
File path: index.html
##########
@@ -1,72 +1,62 @@
---
-layout: default
+layout: home
---
-<div class="jumbotron">
- <h1>Apache Arrow</h1>
- <p class="lead">A cross-language development platform for in-memory
data</p>
- <p>
- <a class="btn btn-lg btn-success" style="white-space: normal;"
href="mailto:[email protected]" role="button">Join Mailing List</a>
- <a class="btn btn-lg btn-primary" style="white-space: normal;"
href="{{ site.baseurl }}/install/" role="button">Install
({{site.data.versions['current'].number}} Release -
{{site.data.versions['current'].date}})</a>
- </p>
-</div>
-<h5>
- Interested in contributing?
- <small class="text-muted">Join the <a
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/"><strong>mailing
list</strong></a> or check out the <a
href="https://cwiki.apache.org/confluence/display/ARROW"><strong>developer
wiki</strong></a>.</small>
-</h5>
-<h5>
- <a href="{{ site.baseurl }}/blog/"><strong>See Latest News</strong></a>
-</h5>
-<p>
- {{ site.description }}
-</p>
-<hr />
+<h1>What is Arrow?</h1>
<div class="row">
<div class="col-lg-4">
- <h2 class="mt-3">Fast</h2>
- <p>Apache Arrow™ enables execution engines to take advantage of
the latest SIMD (Single instruction, multiple data) operations included in
modern processors, for native vectorized optimization of analytical data
processing. Columnar layout is optimized for data locality for better
performance on modern hardware like CPUs and GPUs.</p>
- <p>The Arrow memory format supports <strong>zero-copy reads</strong> for
lightning-fast data access without serialization overhead.</p>
+ <h2 class="mt-3">Format</h2>
+ <p>Apache Arrow defines a language-independent columnar memory format
for flat and hierarchical data, organized for efficient analytic operations on
modern hardware like CPUs and GPUs. The Arrow memory format also supports
<strong>zero-copy reads</strong> for lightning-fast data access without
serialization overhead.</p>
+ <p><a href="{{ site.baseurl }}/overview/">Learn more</a> about the
design or
+ <a href="{{ site.baseurl }}/docs/format/Columnar.html">read the
specification</a>.</p>
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Flexible</h2>
- <p>Arrow acts as a new high-performance interface between various
systems. It is also focused on supporting a wide variety of industry-standard
programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R,
Ruby, and Rust implementations are in progress and more languages are welcome.
+ <h2 class="mt-3">Libraries</h2>
+ <p>The Arrow project includes libraries that implement the memory
specification in many languages. They enable you to use the Arrow format as an
efficient means of sharing data across languages and processes. Libraries are
available for <a href="{{ site.baseurl }}/docs/c_glib/">C</a>, <a href="{{
site.baseurl }}/docs/cpp/">C++</a>, <a
href="https://github.com/apache/arrow/blob/master/csharp/README.md">C#</a>, <a
href="https://godoc.org/github.com/apache/arrow/go/arrow">Go</a>, <a href="{{
site.baseurl }}/docs/java/">Java</a>, <a href="{{ site.baseurl
}}/docs/js/">JavaScript</a>, <a
href="https://github.com/apache/arrow/blob/master/matlab/README.md">MATLAB</a>,
<a href="{{ site.baseurl }}/docs/python/">Python</a>, <a href="{{ site.baseurl
}}/docs/r/">R</a>, <a
href="https://github.com/apache/arrow/blob/master/ruby/README.md">Ruby</a>, and
<a href="https://docs.rs/crate/arrow/">Rust</a>.
</p>
+ See <a href="{{ site.baseurl }}/install/">how to install</a> and <a
href="{{ site.baseurl }}/getting_started/">get started</a>.
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Standard</h2>
- <p>Apache Arrow is backed by key developers of 13 major open source
projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala,
Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto
standard for columnar in-memory analytics.</p>
- <p>Learn more about projects that are <a href="{{ site.baseurl
}}/powered_by/">Powered By Apache Arrow</a></p>
+ <h2 class="mt-3">Applications</h2>
+ <p>Arrow libraries provide a foundation for developers to build fast
analytics applications. <a href="{{ site.baseurl }}/powered_by/">Many popular
projects</a> use Arrow to ship columnar data efficiently or as the basis for
analytic engines.
+ <p>The libraries also include built-in features for working with data
directly, including Parquet file reading and querying large datasets. See more
Arrow <a href="{{ site.baseurl }}/use_cases/">use cases</a>.</p>
</div>
</div>
-<hr />
+
+<h1>Why Arrow?</h1>
Review comment:
"Why use the Arrow Columnar Format?"
##########
File path: index.html
##########
@@ -1,72 +1,62 @@
---
-layout: default
+layout: home
---
-<div class="jumbotron">
- <h1>Apache Arrow</h1>
- <p class="lead">A cross-language development platform for in-memory
data</p>
- <p>
- <a class="btn btn-lg btn-success" style="white-space: normal;"
href="mailto:[email protected]" role="button">Join Mailing List</a>
- <a class="btn btn-lg btn-primary" style="white-space: normal;"
href="{{ site.baseurl }}/install/" role="button">Install
({{site.data.versions['current'].number}} Release -
{{site.data.versions['current'].date}})</a>
- </p>
-</div>
-<h5>
- Interested in contributing?
- <small class="text-muted">Join the <a
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/"><strong>mailing
list</strong></a> or check out the <a
href="https://cwiki.apache.org/confluence/display/ARROW"><strong>developer
wiki</strong></a>.</small>
-</h5>
-<h5>
- <a href="{{ site.baseurl }}/blog/"><strong>See Latest News</strong></a>
-</h5>
-<p>
- {{ site.description }}
-</p>
-<hr />
+<h1>What is Arrow?</h1>
<div class="row">
<div class="col-lg-4">
- <h2 class="mt-3">Fast</h2>
- <p>Apache Arrow™ enables execution engines to take advantage of
the latest SIMD (Single instruction, multiple data) operations included in
modern processors, for native vectorized optimization of analytical data
processing. Columnar layout is optimized for data locality for better
performance on modern hardware like CPUs and GPUs.</p>
- <p>The Arrow memory format supports <strong>zero-copy reads</strong> for
lightning-fast data access without serialization overhead.</p>
+ <h2 class="mt-3">Format</h2>
+ <p>Apache Arrow defines a language-independent columnar memory format
for flat and hierarchical data, organized for efficient analytic operations on
modern hardware like CPUs and GPUs. The Arrow memory format also supports
<strong>zero-copy reads</strong> for lightning-fast data access without
serialization overhead.</p>
+ <p><a href="{{ site.baseurl }}/overview/">Learn more</a> about the
design or
+ <a href="{{ site.baseurl }}/docs/format/Columnar.html">read the
specification</a>.</p>
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Flexible</h2>
- <p>Arrow acts as a new high-performance interface between various
systems. It is also focused on supporting a wide variety of industry-standard
programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R,
Ruby, and Rust implementations are in progress and more languages are welcome.
+ <h2 class="mt-3">Libraries</h2>
+ <p>The Arrow project includes libraries that implement the memory
specification in many languages. They enable you to use the Arrow format as an
efficient means of sharing data across languages and processes. Libraries are
available for <a href="{{ site.baseurl }}/docs/c_glib/">C</a>, <a href="{{
site.baseurl }}/docs/cpp/">C++</a>, <a
href="https://github.com/apache/arrow/blob/master/csharp/README.md">C#</a>, <a
href="https://godoc.org/github.com/apache/arrow/go/arrow">Go</a>, <a href="{{
site.baseurl }}/docs/java/">Java</a>, <a href="{{ site.baseurl
}}/docs/js/">JavaScript</a>, <a
href="https://github.com/apache/arrow/blob/master/matlab/README.md">MATLAB</a>,
<a href="{{ site.baseurl }}/docs/python/">Python</a>, <a href="{{ site.baseurl
}}/docs/r/">R</a>, <a
href="https://github.com/apache/arrow/blob/master/ruby/README.md">Ruby</a>, and
<a href="https://docs.rs/crate/arrow/">Rust</a>.
</p>
+ See <a href="{{ site.baseurl }}/install/">how to install</a> and <a
href="{{ site.baseurl }}/getting_started/">get started</a>.
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Standard</h2>
- <p>Apache Arrow is backed by key developers of 13 major open source
projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala,
Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto
standard for columnar in-memory analytics.</p>
- <p>Learn more about projects that are <a href="{{ site.baseurl
}}/powered_by/">Powered By Apache Arrow</a></p>
+ <h2 class="mt-3">Applications</h2>
Review comment:
Ecosystem?
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
Review comment:
+1
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
Review comment:
Traditionally, data processing engine developers have created custom
data structures to represent datasets in-memory while they are being processed.
Given the "custom" nature of these data structures, they must also develop
serialization interfaces to convert between these data structures and different
file formats, network wire protocols, database clients, and other data
transport interface. The net result of this is an incredible amount of waste
both in developer time and in CPU cycles spend serializing data from one format
to another.
Therefore, the rationale for Arrow's in-memory columnar data format is to
provide an out-of-the-box solution to several interrelated problems:
* A general purpose tabular data representation that is highly efficient to
process on modern hardware while also being suitable for a wide spectrum of use
cases. We believe that fewer and fewer systems will create their own data
structures and simply use Arrow.
* Supports both random access and streaming / scan-based workloads.
* A standardized memory format facilitates reuse of libraries of algorithms.
When custom in-memory data formats are used, common algorithms must often be
rewritten to target those custom data formats.
* Systems that both use or support Arrow can transfer data between them at
little-to-no cost. This results in a radical reduction in the amount of
serialization overhead in analytical workloads that can often represent 80-90%
of computing costs.
* The language-agnostic design of the Arrow format enables systems written
in different programming languages (even running on the JVM) to communicate
datasets without serialization overhead. For example, a Java application can
call a C or C++ algorithm on data that originated in the JVM.
... probably some other stuff can be added here
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
Review comment:
"Apache Arrow"
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
-In some applications, Parquet and Arrow can be used interchangeably for
on-disk data serialization. Some things to keep in mind:
+* Reading Parquet files generally requires expensive decoding, while reading
+ Arrow IPC files is just a matter of transferring raw bytes from the storage
+ hardware.
-* Parquet is intended for "archival" purposes, meaning if you write a file
today, we expect that any system that says they can "read Parquet" will be able
to read the file in 5 years or 7 years. We are not yet making this assertion
about long-term stability of the Arrow format.
-* Parquet is generally a lot more expensive to read because it must be decoded
into some other data structure. Arrow protocol data can simply be memory-mapped.
-* Parquet files are often much smaller than Arrow-protocol-on-disk because of
the data encoding schemes that Parquet uses. If your disk storage or network is
slow, Parquet may be a better choice.
+* Parquet files are often much smaller than Arrow IPC files because of the
+ elaborate encoding schemes that Parquet uses. If your disk storage or network
+ is slow, Parquet may be a better choice even for short-term storage or
caching.
+
+### What about the "Feather" file format?
+
+The Feather v1 format started as a separate specification, but the Feather v2
+format is just another, easier to remember name for the Arrow IPC file format.
### How does Arrow relate to Flatbuffers?
-Flatbuffers is a domain-agnostic low-level building block for binary data
formats. It cannot be used directly for data analysis tasks without a lot of
manual scaffolding. Arrow is a data layer aimed directly at the needs of data
analysis, providing elaborate data types (including extensible logical types),
built-in support for "null" values (a.k.a "N/A"), and an expanding toolbox of
I/O and computing facilities.
+Flatbuffers is a low-level building block for binary data serialization.
+It is not adapted to the representation of large, structured, homogenous
+data, and does not sit at the right abstraction layer for data analysis tasks.
+
+Arrow is a data layer aimed directly at the needs of data analysis, providing
+elaborate data types (including extensible logical types), built-in support
+for "null" values (representing missing data), and an expanding toolbox of I/O
+and computing facilities.
-The Arrow file format does use Flatbuffers under the hood to facilitate
low-level metadata serialization. However, Arrow data has much richer semantics
than Flatbuffers data.
+The Arrow file format does use Flatbuffers under the hood to facilitate
low-level
+metadata serialization, but the Arrow data format uses its own representation
Review comment:
maybe "to serialize schemas and other metadata needed to implement the
Arrow binary IPC protocol"
##########
File path: getting_started.md
##########
@@ -0,0 +1,74 @@
+---
+layout: default
+title: Getting started
+description: Links to user guides to help you start using Arrow
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Getting started
+
+This page collects resources and guides for using Arrow in all of the
project's languages.
+For reference on official release packages, see the
+[install page]({{ site.baseurl }}/install/).
+
+## C
+
+Glib
Review comment:
TODO
##########
File path: index.html
##########
@@ -1,72 +1,62 @@
---
-layout: default
+layout: home
---
-<div class="jumbotron">
- <h1>Apache Arrow</h1>
- <p class="lead">A cross-language development platform for in-memory
data</p>
- <p>
- <a class="btn btn-lg btn-success" style="white-space: normal;"
href="mailto:[email protected]" role="button">Join Mailing List</a>
- <a class="btn btn-lg btn-primary" style="white-space: normal;"
href="{{ site.baseurl }}/install/" role="button">Install
({{site.data.versions['current'].number}} Release -
{{site.data.versions['current'].date}})</a>
- </p>
-</div>
-<h5>
- Interested in contributing?
- <small class="text-muted">Join the <a
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/"><strong>mailing
list</strong></a> or check out the <a
href="https://cwiki.apache.org/confluence/display/ARROW"><strong>developer
wiki</strong></a>.</small>
-</h5>
-<h5>
- <a href="{{ site.baseurl }}/blog/"><strong>See Latest News</strong></a>
-</h5>
-<p>
- {{ site.description }}
-</p>
-<hr />
+<h1>What is Arrow?</h1>
<div class="row">
<div class="col-lg-4">
- <h2 class="mt-3">Fast</h2>
- <p>Apache Arrow™ enables execution engines to take advantage of
the latest SIMD (Single instruction, multiple data) operations included in
modern processors, for native vectorized optimization of analytical data
processing. Columnar layout is optimized for data locality for better
performance on modern hardware like CPUs and GPUs.</p>
- <p>The Arrow memory format supports <strong>zero-copy reads</strong> for
lightning-fast data access without serialization overhead.</p>
+ <h2 class="mt-3">Format</h2>
+ <p>Apache Arrow defines a language-independent columnar memory format
for flat and hierarchical data, organized for efficient analytic operations on
modern hardware like CPUs and GPUs. The Arrow memory format also supports
<strong>zero-copy reads</strong> for lightning-fast data access without
serialization overhead.</p>
+ <p><a href="{{ site.baseurl }}/overview/">Learn more</a> about the
design or
+ <a href="{{ site.baseurl }}/docs/format/Columnar.html">read the
specification</a>.</p>
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Flexible</h2>
- <p>Arrow acts as a new high-performance interface between various
systems. It is also focused on supporting a wide variety of industry-standard
programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R,
Ruby, and Rust implementations are in progress and more languages are welcome.
+ <h2 class="mt-3">Libraries</h2>
+ <p>The Arrow project includes libraries that implement the memory
specification in many languages. They enable you to use the Arrow format as an
efficient means of sharing data across languages and processes. Libraries are
available for <a href="{{ site.baseurl }}/docs/c_glib/">C</a>, <a href="{{
site.baseurl }}/docs/cpp/">C++</a>, <a
href="https://github.com/apache/arrow/blob/master/csharp/README.md">C#</a>, <a
href="https://godoc.org/github.com/apache/arrow/go/arrow">Go</a>, <a href="{{
site.baseurl }}/docs/java/">Java</a>, <a href="{{ site.baseurl
}}/docs/js/">JavaScript</a>, <a
href="https://github.com/apache/arrow/blob/master/matlab/README.md">MATLAB</a>,
<a href="{{ site.baseurl }}/docs/python/">Python</a>, <a href="{{ site.baseurl
}}/docs/r/">R</a>, <a
href="https://github.com/apache/arrow/blob/master/ruby/README.md">Ruby</a>, and
<a href="https://docs.rs/crate/arrow/">Rust</a>.
</p>
+ See <a href="{{ site.baseurl }}/install/">how to install</a> and <a
href="{{ site.baseurl }}/getting_started/">get started</a>.
</div>
<div class="col-lg-4">
- <h2 class="mt-3">Standard</h2>
- <p>Apache Arrow is backed by key developers of 13 major open source
projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala,
Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto
standard for columnar in-memory analytics.</p>
- <p>Learn more about projects that are <a href="{{ site.baseurl
}}/powered_by/">Powered By Apache Arrow</a></p>
+ <h2 class="mt-3">Applications</h2>
+ <p>Arrow libraries provide a foundation for developers to build fast
analytics applications. <a href="{{ site.baseurl }}/powered_by/">Many popular
projects</a> use Arrow to ship columnar data efficiently or as the basis for
analytic engines.
+ <p>The libraries also include built-in features for working with data
directly, including Parquet file reading and querying large datasets. See more
Arrow <a href="{{ site.baseurl }}/use_cases/">use cases</a>.</p>
Review comment:
I would say to condense the 2nd and 3rd points here and change this 3rd
one to be about the ecosystem/community
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
-In some applications, Parquet and Arrow can be used interchangeably for
on-disk data serialization. Some things to keep in mind:
+* Reading Parquet files generally requires expensive decoding, while reading
+ Arrow IPC files is just a matter of transferring raw bytes from the storage
+ hardware.
-* Parquet is intended for "archival" purposes, meaning if you write a file
today, we expect that any system that says they can "read Parquet" will be able
to read the file in 5 years or 7 years. We are not yet making this assertion
about long-term stability of the Arrow format.
-* Parquet is generally a lot more expensive to read because it must be decoded
into some other data structure. Arrow protocol data can simply be memory-mapped.
-* Parquet files are often much smaller than Arrow-protocol-on-disk because of
the data encoding schemes that Parquet uses. If your disk storage or network is
slow, Parquet may be a better choice.
+* Parquet files are often much smaller than Arrow IPC files because of the
+ elaborate encoding schemes that Parquet uses. If your disk storage or network
+ is slow, Parquet may be a better choice even for short-term storage or
caching.
+
+### What about the "Feather" file format?
+
+The Feather v1 format started as a separate specification, but the Feather v2
+format is just another, easier to remember name for the Arrow IPC file format.
Review comment:
"started as a separate specification" -> "was a simplified custom
container for writing a subset of the Arrow format to disk prior to the
development of the Arrow IPC file format. "Feather version 2" is now exactly
the Arrow IPC file format and we have retained the "Feather" name and APIs for
backwards compatibility."
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
+backwards-compatible changes, such as additional data types. It is used by
+many applications already, and you can trust that compatibility will not be
+broken.
+
+The Arrow *file format* (based on the Arrow IPC mechanism) is not recommended
+for long-term disk persistence of data; that said, it is perfectly acceptable
+to write Arrow memory to disk for purposes of memory mapping and caching.
+
+We encourage people to start building Arrow-based in-memory computing
+applications now, and choose a suitable file format for disk storage
+if necessary. The Arrow libraries include adapters for several file formats,
+including Parquet, ORC, CSV, and JSON.
+
+### How stable are the Arrow libraries?
+
+We refer you to the [implementation
matrix](https://github.com/apache/arrow/blob/master/docs/source/status.rst).
+
+## Getting started
+
+### Where can I get Arrow libraries?
+
+Arrow libraries for many languages are available through the usual package
+managers. See the [install]({{ site.baseurl }}/install/) page for specifics.
-The Arrow in-memory format is considered stable, and we intend to make only
backwards-compatible changes, such as additional data types. We do not yet
recommend the Arrow file format for long-term disk persistence of data; that
said, it is perfectly acceptable to write Arrow memory to disk for purposes of
memory mapping and caching.
+## Getting involved
-We encourage people to start building Arrow-based in-memory computing
applications now, and choose a suitable file format for disk storage if
necessary. The Arrow libraries include adapters for several file formats,
including Parquet, ORC, CSV, and JSON.
+### I have some questions. How can I get help?
+
+The [Arrow mailing lists]({{ site.baseurl }}/community/) are the best place
+to ask questions. Don't be shy--we're here to help.
+
+### I tried to use Arrow and it didn't work. Can you fix it?
+
+Hopefully! Please make a detailed bug report--that's a valuable contribution
+to the project itself.
+See the [contribution guidelines]({{ site.baseurl
}}/docs/developers/contributing.html)
+for how to make a report.
+
+### Arrow looks great and I'd totally use it if it only did X. When will it be
done?
+
+We use [JIRA](https://issues.apache.org/jira/browse/ARROW) for our issue
tracker.
+Search for an issue that matches your need. If you find one, feel free to
+comment on it and describe your use case--that will help whoever picks up
+the task. If you don't find one, make it.
+
+Ultimately, Arrow is software written by and for the community. If you don't
+see someone else in the community working on your issue, the best way to get
+it done is to pitch in yourself. We're more than willing to help you contribute
+successfully to the project.
+
+### How can I report a security vulnerability?
+
+Please send an email to
[[email protected]](mailto:[email protected]).
+See the [security]({{ site.baseurl }}/security/) page for more.
+
+## Relation to other projects
### What is the difference between Apache Arrow and Apache Parquet?
+<!-- Revise this -->
+
+Parquet is a storage format designed for maximum space efficiency, using
+advanced compression and encoding techniques. It is ideal when wanting to
+minimize disk usage while storing gigabytes of data, or perhaps more.
+This efficiency comes at the cost of relatively expensive reading into memory,
+as Parquet data cannot be directly operated on but must be decoded in
+large chunks.
+
+Conversely, Arrow is an in-memory format meant for direct and efficient use
+for computational purposes. Arrow data is not compressed (or only lightly so,
+when using dictionary encoding) but laid out in natural format for the CPU,
+so that data can be accessed at arbitrary places at full speed.
+
+Therefore, Arrow and Parquet are not competitors: they complement each other
+and are commonly used together in applications. Storing your data on disk
+using Parquet, and reading it into memory in the Arrow format, will allow
+you to make the most of your computing hardware.
-In short, Parquet files are designed for disk storage, while Arrow is designed
for in-memory use, but you can put it on disk and then memory-map later. Arrow
and Parquet are intended to be compatible with each other and used together in
applications.
+### What about "Arrow files" then?
-Parquet is a columnar file format for data serialization. Reading a Parquet
file requires decompressing and decoding its contents into some kind of
in-memory data structure. It is designed to be space/IO-efficient at the
expensive CPU utilization for decoding. It does not provide any data structures
for in-memory computing. Parquet is a streaming format which must be decoded
from start-to-end; while some "index page" facilities have been added to the
storage format recently, random access operations are generally costly.
+Apache Arrow defines an inter-process communication (IPC) mechanism to
+transfer a collection of Arrow columnar arrays (called a "record batch").
+It can be used synchronously between processes using the Arrow "stream format",
+or asynchronously by first persisting data on storage using the Arrow "file
format".
-Arrow on the other hand is first and foremost a library providing columnar
data structures for *in-memory computing*. When you read a Parquet file, you
can decompress and decode the data *into* Arrow columnar data structures so
that you can then perform analytics in-memory on the decoded data. The Arrow
columnar format has some nice properties: random access is O(1) and each value
cell is next to the previous and following one in memory, so it's efficient to
iterate over.
+The Arrow IPC mechanism is based on the Arrow in-memory format, such that
+there is no translation necessary between the on-disk representation and
+the in-memory representation. Therefore, performing analytics on an Arrow
+IPC file can use memory-mapping and pay effectively zero cost.
-What about "Arrow files" then? Apache Arrow defines a binary "serialization"
protocol for arranging a collection of Arrow columnar arrays (called a "record
batch") that can be used for messaging and interprocess communication. You can
put the protocol anywhere, including on disk, which can later be memory-mapped
or read into memory and sent elsewhere.
+Some things to keep in mind when comparing the Arrow IPC file format and the
+Parquet format:
-This Arrow protocol is designed so that you can "map" a blob of Arrow data
without doing any deserialization, so performing analytics on Arrow protocol
data on disk can use memory-mapping and pay effectively zero cost. The protocol
is used for many other things as well, such as streaming data between Spark SQL
and Python for running pandas functions against chunks of Spark SQL data (these
are called "pandas udfs").
+* Parquet is safe for long-term storage and archival purposes, meaning if
+ you write a file today, you can expect that any system that says they can
+ "read Parquet" will be able to read the file in 5 years or 10 years.
+ We are not yet making this assertion about long-term stability of the Arrow
+ format.
-In some applications, Parquet and Arrow can be used interchangeably for
on-disk data serialization. Some things to keep in mind:
+* Reading Parquet files generally requires expensive decoding, while reading
+ Arrow IPC files is just a matter of transferring raw bytes from the storage
+ hardware.
-* Parquet is intended for "archival" purposes, meaning if you write a file
today, we expect that any system that says they can "read Parquet" will be able
to read the file in 5 years or 7 years. We are not yet making this assertion
about long-term stability of the Arrow format.
-* Parquet is generally a lot more expensive to read because it must be decoded
into some other data structure. Arrow protocol data can simply be memory-mapped.
-* Parquet files are often much smaller than Arrow-protocol-on-disk because of
the data encoding schemes that Parquet uses. If your disk storage or network is
slow, Parquet may be a better choice.
+* Parquet files are often much smaller than Arrow IPC files because of the
+ elaborate encoding schemes that Parquet uses. If your disk storage or network
Review comment:
"elaborate" seems a bit emotionally charged to me, let's use something
more neutral and precise
"elaborate encoding schemes" -> "columnar data compression strategies"
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
+
+<!-- Fill this in -->
+
+## Project status
+
### How stable is the Arrow format? Is it safe to use in my application?
+<!-- Revise this -->
+
+The Arrow *in-memory format* is considered stable, and we intend to make only
Review comment:
Maybe "columnar format and protocol"
##########
File path: faq.md
##########
@@ -24,32 +24,155 @@ limitations under the License.
# Frequently Asked Questions
+## General
+
+### What *is* Arrow?
+
+Arrow is an open standard for how to represent columnar data in memory, along
+with libraries in many languages that implement that standard. The Arrow
format
+allows different programs and runtimes, perhaps written in different languages,
+to share data efficiently using a set of rich data types (included nested
+and user-defined data types). The Arrow libraries make it easy to write such
+programs, by sparing the programmer from implementing low-level details of the
+Arrow format.
+
+Arrow additionally defines a streaming format and a file format for
+inter-process communication (IPC), based on the in-memory format. It also
+defines a generic client-server RPC mechanism (Arrow Flight), based on the
+IPC format, and implemented on top of the gRPC framework. <!-- TODO links -->
+
+### Why create a new standard?
Review comment:
"Why define a standard for columnar in-memory?"
There can't be a new standard if there isn't an old one. There never was
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]