Repository: arrow-site Updated Branches: refs/heads/asf-site 6a8b4465c -> 0a7dc4187
http://git-wip-us.apache.org/repos/asf/arrow-site/blob/0a7dc418/docs/ipc.html ---------------------------------------------------------------------- diff --git a/docs/ipc.html b/docs/ipc.html index 6d96632..c480fea 100644 --- a/docs/ipc.html +++ b/docs/ipc.html @@ -145,7 +145,7 @@ <ul> <li>A length prefix indicating the metadata size</li> - <li>The message metadata as a <a href="https://github.com/google/flatbuffers">Flatbuffer</a></li> + <li>The message metadata as a <a href="https://github.com/google]/flatbuffers">Flatbuffer</a></li> <li>Padding bytes to an 8-byte boundary</li> <li>The message body, which must be a multiple of 8 bytes</li> </ul> @@ -190,9 +190,7 @@ flatbuffer union), and the size of the message body:</p> of encapsulated messages, each of which follows the format above. The schema comes first in the stream, and it is the same for all of the record batches that follow. If any fields in the schema are dictionary-encoded, one or more -<code class="highlighter-rouge">DictionaryBatch</code> messages will be included. <code class="highlighter-rouge">DictionaryBatch</code> and -<code class="highlighter-rouge">RecordBatch</code> messages may be interleaved, but before any dictionary key is used -in a <code class="highlighter-rouge">RecordBatch</code> it should be defined in a <code class="highlighter-rouge">DictionaryBatch</code>.</p> +<code class="highlighter-rouge">DictionaryBatch</code> messages will follow the schema.</p> <div class="highlighter-rouge"><pre class="highlight"><code><SCHEMA> <DICTIONARY 0> @@ -200,10 +198,6 @@ in a <code class="highlighter-rouge">RecordBatch</code> it should be defined in <DICTIONARY k - 1> <RECORD BATCH 0> ... -<DICTIONARY x DELTA> -... -<DICTIONARY y DELTA> -... <RECORD BATCH n - 1> <EOS [optional]: int32> </code></pre> @@ -238,10 +232,6 @@ footer.</p> </code></pre> </div> -<p>In the file format, there is no requirement that dictionary keys should be -defined in a <code class="highlighter-rouge">DictionaryBatch</code> before they are used in a <code class="highlighter-rouge">RecordBatch</code>, as long -as the keys are defined somewhere in the file.</p> - <h3 id="recordbatch-body-structure">RecordBatch body structure</h3> <p>The <code class="highlighter-rouge">RecordBatch</code> metadata contains a depth-first (pre-order) flattened set of @@ -315,7 +305,6 @@ the dictionaries can be properly interpreted.</p> <div class="highlighter-rouge"><pre class="highlight"><code>table DictionaryBatch { id: long; data: RecordBatch; - isDelta: boolean = false; } </code></pre> </div> @@ -325,38 +314,6 @@ in the schema, so that dictionaries can even be used for multiple fields. See the <a href="https://github.com/apache/arrow/blob/master/format/Layout.md">Physical Layout</a> document for more about the semantics of dictionary-encoded data.</p> -<p>The dictionary <code class="highlighter-rouge">isDelta</code> flag allows dictionary batches to be modified -mid-stream. A dictionary batch with <code class="highlighter-rouge">isDelta</code> set indicates that its vector -should be concatenated with those of any previous batches with the same <code class="highlighter-rouge">id</code>. A -stream which encodes one column, the list of strings -<code class="highlighter-rouge">["A", "B", "C", "B", "D", "C", "E", "A"]</code>, with a delta dictionary batch could -take the form:</p> - -<div class="highlighter-rouge"><pre class="highlight"><code><SCHEMA> -<DICTIONARY 0> -(0) "A" -(1) "B" -(2) "C" - -<RECORD BATCH 0> -0 -1 -2 -1 - -<DICTIONARY 0 DELTA> -(3) "D" -(4) "E" - -<RECORD BATCH 1> -3 -2 -4 -0 -EOS -</code></pre> -</div> - <h3 id="tensor-multi-dimensional-array-message-format">Tensor (Multi-dimensional Array) Message Format</h3> <p>The <code class="highlighter-rouge">Tensor</code> message types provides a way to write a multidimensional array of http://git-wip-us.apache.org/repos/asf/arrow-site/blob/0a7dc418/docs/memory_layout.html ---------------------------------------------------------------------- diff --git a/docs/memory_layout.html b/docs/memory_layout.html index 0eb8d03..16a43ea 100644 --- a/docs/memory_layout.html +++ b/docs/memory_layout.html @@ -161,8 +161,9 @@ from <code class="highlighter-rouge">List<V></code> iff U and V are differ or a fully-specified nested type. When we say slot we mean a relative type value, not necessarily any physical storage region.</li> <li>Logical type: A data type that is implemented using some relative (physical) -type. For example, Decimal values are stored as 16 bytes in a fixed byte -size array. Similarly, strings can be stored as <code class="highlighter-rouge">List<1-byte></code>.</li> +type. For example, a Decimal value stored in 16 bytes could be stored in a +primitive array with slot size 16 bytes. Similarly, strings can be stored as +<code class="highlighter-rouge">List<1-byte></code>.</li> <li>Parent and child arrays: names to express relationships between physical value arrays in a nested type structure. For example, a <code class="highlighter-rouge">List<T></code>-type parent array has a T-type array as its child (see more on lists below).</li> @@ -751,9 +752,9 @@ the the types array indicates that a slot contains a different type at the index <h2 id="dictionary-encoding">Dictionary encoding</h2> <p>When a field is dictionary encoded, the values are represented by an array of Int32 representing the index of the value in the dictionary. -The Dictionary is received as one or more DictionaryBatches with the id referenced by a dictionary attribute defined in the metadata (<a href="https://github.com/apache/arrow/blob/master/format/Message.fbs">Message.fbs</a>) in the Field table. -The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatches. -When a Schema references a Dictionary id, it must send at least one DictionaryBatch for this id.</p> +The Dictionary is received as a DictionaryBatch whose id is referenced by a dictionary attribute defined in the metadata (<a href="https://github.com/apache/arrow/blob/master/format/Message.fbs">Message.fbs</a>) in the Field table. +The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatch. +When a Schema references a Dictionary id, it must send a DictionaryBatch for this id before any RecordBatch.</p> <p>As an example, you could have the following data:</p> <div class="highlighter-rouge"><pre class="highlight"><code>type: List<String> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/0a7dc418/docs/metadata.html ---------------------------------------------------------------------- diff --git a/docs/metadata.html b/docs/metadata.html index 9b12883..9e25689 100644 --- a/docs/metadata.html +++ b/docs/metadata.html @@ -530,8 +530,7 @@ logical type, which have no children) and 3 buffers:</p> <h3 id="decimal">Decimal</h3> -<p>Decimals are represented as a 2âs complement 128-bit (16 byte) signed integer -in little-endian byte order.</p> +<p>TBD</p> <h3 id="timestamp">Timestamp</h3> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/0a7dc418/feed.xml ---------------------------------------------------------------------- diff --git a/feed.xml b/feed.xml index 27aeb5d..ea204d8 100644 --- a/feed.xml +++ b/feed.xml @@ -1,4 +1,238 @@ -<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-12-18T19:07:25-08:00</updated><id>/</id><entry><title type="html">Fast Python Serialization with Ray and Apache Arrow</title><link href="/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/" rel="alternate" type="text/html" title="Fast Python Serialization with Ray and Apache Arrow" /><published>2017-10-15T07:00:00-07:00</published><updated>2017-10-15T07:00:00-07:00</updated><id>/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow</id><content type="html" xml:base="/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/"><!-- +<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-12-19T10:30:45-05:00</updated><id>/</id><entry><title type="html">Apache Arrow 0.8.0 Release</title><link href="/blog/2017/12/18/0.8.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.8.0 Release" /><published>2017-12-18T23:01:00-05:00</published><updated>2017-12-18T23:01:00-05:00</updated><id>/blog/2017/12/18/0.8.0-release</id><content type="html" xml:base="/blog/2017/12/18/0.8.0-release/"><!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.8.0 release. It is the +product of 10 weeks of development andincludes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.8.0"><strong>286 resolved JIRAs</strong></a> with +many new features and bug fixes to the various language implementations. This +is the largest release since 0.3.0 earlier this year.</p> + +<p>As part of work towards a stabilizing the Arrow format and making a 1.0.0 +release sometime in 2018, we made a series of backwards-incompatible changes to +the serialized Arrow metadata that requires Arrow readers and writers (0.7.1 +and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We +expect future backwards-incompatible changes to be rare going forward.</p> + +<p>See the <a href="https://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="https://github.com/kou">complete changelog</a> is also available.</p> + +<p>We discuss some highlights from the release and other project news in this +post.</p> + +<h2 id="projects-powered-by-apache-arrow">Projects âPowered Byâ Apache Arrow</h2> + +<p>A growing ecosystem of projects are using Arrow to solve in-memory analytics +and data interchange problems. We have added a new <a href="http://arrow.apache.org/powered_by/">Powered By</a> page to the +Arrow website where we can acknowledge open source projects and companies which +are using Arrow. If you would like to add your project to the list as an Arrow +user, please let us know.</p> + +<h2 id="new-arrow-committers">New Arrow committers</h2> + +<p>Since the last release, we have added 5 new Apache committers:</p> + +<ul> + <li><a href="https://github.com/cpcloud">Phillip Cloud</a>, who has mainly contributed to C++ and Python</li> + <li><a href="https://github.com/BryanCutler">Bryan Cutler</a>, who has mainly contributed to Java and Spark integration</li> + <li><a href="https://github.com/icexelloss">Li Jin</a>, who has mainly contributed to Java and Spark integration</li> + <li><a href="https://github.com/trxcllnt">Paul Taylor</a>, who has mainly contributed to JavaScript</li> + <li><a href="https://github.com/siddharthteotia">Siddharth Teotia</a>, who has mainly contributed to Java</li> +</ul> + +<p>Welcome to the Arrow team, and thank you for your contributions!</p> + +<h2 id="improved-java-vector-api-performance-improvements">Improved Java vector API, performance improvements</h2> + +<p>Siddharth Teotia led efforts to revamp the Java vector API to make things +simpler and faster. As part of this, we removed the dichotomy between nullable +and non-nullable vectors.</p> + +<p>See <a href="https://arrow.apache.org/blog/2017/12/19/java-vector-improvements/">Siddâs blog post</a> for more about these changes.</p> + +<h2 id="decimal-support-in-c-python-consistency-with-java">Decimal support in C++, Python, consistency with Java</h2> + +<p><a href="https://github.com/cpcloud">Phillip Cloud</a> led efforts this release to harden details about exact +decimal values in the Arrow specification and ensure a consistent +implementation across Java, C++, and Python.</p> + +<p>Arrow now supports decimals represented internally as a 128-bit little-endian +integer, with a set precision and scale (as defined in many SQL-based +systems). As part of this work, we needed to change Javaâs internal +representation from big- to little-endian.</p> + +<p>We are now integration testing decimals between Java, C++, and Python, which +will facilitate Arrow adoption in Apache Spark and other systems that use both +Java and Python.</p> + +<p>Decimal data can now be read and written by the <a href="https://github.com/apache/parquet-cpp">Apache Parquet C++ +library</a>, including via pyarrow.</p> + +<p>In the future, we may implement support for smaller-precision decimals +represented by 32- or 64-bit integers.</p> + +<h2 id="c-improvements-expanded-kernels-library-and-more">C++ improvements: expanded kernels library and more</h2> + +<p>In C++, we have continued developing the new <code class="highlighter-rouge">arrow::compute</code> submodule +consisting of native computation fuctions for Arrow data. New contributor +<a href="https://github.com/licht-t">Licht Takeuchi</a> helped expand the supported types for type casting in +<code class="highlighter-rouge">compute::Cast</code>. We have also implemented new kernels <code class="highlighter-rouge">Unique</code> and +<code class="highlighter-rouge">DictionaryEncode</code> for computing the distinct elements of an array and +dictionary encoding (conversion to categorical), respectively.</p> + +<p>We expect the C++ computation âkernelâ library to be a major expansion area for +the project over the next year and beyond. Here, we can also implement SIMD- +and GPU-accelerated versions of basic in-memory analytics functionality.</p> + +<p>As minor breaking API change in C++, we have made the <code class="highlighter-rouge">RecordBatch</code> and <code class="highlighter-rouge">Table</code> +APIs âvirtualâ or abstract interfaces, to enable different implementations of a +record batch or table which conform to the standard interface. This will help +enable features like lazy IO or column loading.</p> + +<p>There was significant work improving the C++ library generally and supporting +work happening in Python and C. See the change log for full details.</p> + +<h2 id="glib-c-improvements-meson-build-gpu-support">GLib C improvements: Meson build, GPU support</h2> + +<p>Developing of the GLib-based C bindings has generally tracked work happening in +the C++ library. These bindings are being used to develop <a href="https://github.com/red-data-tools">data science tools +for Ruby users</a> and elsewhere.</p> + +<p>The C bindings now support the <a href="https://mesonbuild.com">Meson build system</a> in addition to +autotools, which enables them to be built on Windows.</p> + +<p>The Arrow GPU extension library is now also supported in the C bindings.</p> + +<h2 id="javascript-first-independent-release-on-npm">JavaScript: first independent release on NPM</h2> + +<p><a href="https://github.com/TheNeuralBit">Brian Hulette</a> and <a href="https://github.com/trxcllnt">Paul Taylor</a> have been continuing to drive efforts +on the TypeScript-based JavaScript implementation.</p> + +<p>Since the last release, we made a first JavaScript-only Apache release, version +0.2.0, which is <a href="http://npmjs.org/package/apache-arrow">now available on NPM</a>. We decided to make separate +JavaScript releases to enable the JS library to release more frequently than +the rest of the project.</p> + +<h2 id="python-improvements">Python improvements</h2> + +<p>In addition to some of the new features mentioned above, we have made a variety +of usability and performance improvements for integrations with pandas, NumPy, +Dask, and other Python projects which may make use of pyarrow, the Arrow Python +library.</p> + +<p>Some of these improvements include:</p> + +<ul> + <li><a href="http://arrow.apache.org/docs/python/ipc.html">Component-based serialization</a> for more flexible and memory-efficient +transport of large or complex Python objects</li> + <li>Substantially improved serialization performance for pandas objects when +using <code class="highlighter-rouge">pyarrow.serialize</code> and <code class="highlighter-rouge">pyarrow.deserialize</code>. This includes a special +<code class="highlighter-rouge">pyarrow.pandas_serialization_context</code> which further accelerates certain +internal details of pandas serialization * Support zero-copy reads for</li> + <li><code class="highlighter-rouge">pandas.DataFrame</code> using <code class="highlighter-rouge">pyarrow.deserialize</code> for objects without Python +objects</li> + <li>Multithreaded conversions from <code class="highlighter-rouge">pandas.DataFrame</code> to <code class="highlighter-rouge">pyarrow.Table</code> (we +already supported multithreaded conversions from Arrow back to pandas)</li> + <li>More efficient conversion from 1-dimensional NumPy arrays to Arrow format</li> + <li>New generic buffer compression and decompression APIs <code class="highlighter-rouge">pyarrow.compress</code> and +<code class="highlighter-rouge">pyarrow.decompress</code></li> + <li>Enhanced Parquet cross-compatibility with <a href="https://github.com/dask/fastparquet">fastparquet</a> and improved Dask +support</li> + <li>Python support for accessing Parquet row group column statistics</li> +</ul> + +<h2 id="upcoming-roadmap">Upcoming Roadmap</h2> + +<p>The 0.8.0 release includes some API and format changes, but upcoming releases +will focus on ompleting and stabilizing critical functionality to move the +project closer to a 1.0.0 release.</p> + +<p>With the ecosystem of projects using Arrow expanding rapidly, we will be +working to improve and expand the libraries in support of downstream use cases.</p> + +<p>We continue to look for more JavaScript, Julia, R, Rust, and other programming +language developers to join the project and expand the available +implementations and bindings to more languages.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Improvements to Java Vector API in Apache Arrow 0.8.0</title><link href="/blog/2017/12/19/java-vector-improvements/" rel="alternate" type="text/html" title="Improvements to Java Vector API in Apache Arrow 0.8.0" /><published>2017-12-18T19:00:00-05:00</published><updated>2017-12-18T19:00:00-05:00</updated><id>/blog/2017/12/19/java-vector-improvements</id><content type="html" xml:base="/blog/2017/12/19/java-vector-improvements/"><!-- + +--> + +<p>This post gives insight into the major improvements in the Java implementation +of vectors. We undertook this work over the last 10 weeks since the last Arrow +release.</p> + +<h2 id="design-goals">Design Goals</h2> + +<ol> + <li>Improved maintainability and extensibility</li> + <li>Improved heap memory usage</li> + <li>No performance overhead on hot code paths</li> +</ol> + +<h2 id="background">Background</h2> + +<h3 id="improved-maintainability-and-extensibility">Improved maintainability and extensibility</h3> + +<p>We use templates in several places for compile time Java code generation for +different vector classes, readers, writers etc. Templates are helpful as the +developers donât have to write a lot of duplicate code.</p> + +<p>However, we realized that over a period of time some specific Java +templates became extremely complex with giant if-else blocks, poor code indentation +and documentation. All this impacted the ability to easily extend these templates +for adding new functionality or improving the existing infrastructure.</p> + +<p>So we evaluated the usage of templates for compile time code generation and +decided not to use complex templates in some places by writing small amount of +duplicate code which is elegant, well documented and extensible.</p> + +<h3 id="improved-heap-usage">Improved heap usage</h3> + +<p>We did extensive memory analysis downstream in <a href="https://www.dremio.com/">Dremio</a> where Arrow is used +heavily for in-memory query execution on columnar data. The general conclusion +was that Arrowâs Java vector classes have non-negligible heap overhead and +volume of objects was too high. There were places in code where we were +creating objects unnecessarily and using structures that could be substituted +with better alternatives.</p> + +<h3 id="no-performance-overhead-on-hot-code-paths">No performance overhead on hot code paths</h3> + +<p>Java vectors used delegation and abstraction heavily throughout the object +hierarchy. The performance critical get/set methods of vectors went through a +chain of function calls back and forth between different objects before doing +meaningful work. We also evaluated the usage of branches in vector APIs and +reimplemented some of them by avoiding branches completely.</p> + +<p>We took inspiration from how the Java memory code in <code class="highlighter-rouge">ArrowBuf</code> works. For all +the performance critical methods, <code class="highlighter-rouge">ArrowBuf</code> bypasses all the netty object +hierarchy, grabs the target virtual address and directly interacts with the +memory.</p> + +<p>There were cases where branches could be avoided all together.</p> + +<p>In case of nullable vectors, we were doing multiple checks to confirm if +the value at a given position in the vector is null or not.</p> + +<h2 id="our-implementation-approach">Our implementation approach</h2> + +<ul> + <li>For scalars, the inheritance tree was simplified by writing different +abstract base classes for fixed and variable width scalars.</li> + <li>The base classes contained all the common functionality across different +types.</li> + <li>The individual subclasses implemented type specific APIs for fixed and +variable width scalar vectors.</li> + <li>For the performance critical methods, all the work is done either in +the vector class or corresponding ArrowBuf. There is no delegation to any +internal object.</li> + <li>The mutator and accessor based access to vector APIs is removed. These +objects led to unnecessary heap overhead and complicated the use of APIs.</li> + <li>Both scalar and complex vectors directly interact with underlying buffers +that manage the offsets, data and validity. Earlier we were creating different +inner vectors for each vector and delegating all the functionality to inner +vectors. This introduced a lot of bugs in memory management, excessive heap +overhead and performance penalty due to chain of delegations.</li> + <li>We reduced the number of vector classes by removing non-nullable vectors. +In the new implementation, all vectors in Java are nullable in nature.</li> +</ul></content><author><name>Siddharth Teotia</name></author><summary type="html">This post describes the recent improvements in Java Vector code</summary></entry><entry><title type="html">Fast Python Serialization with Ray and Apache Arrow</title><link href="/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/" rel="alternate" type="text/html" title="Fast Python Serialization with Ray and Apache Arrow" /><published>2017-10-15T10:00:00-04:00</published><updated>2017-10-15T10:00:00-04:00</updated><id>/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow</id><content type="html" xml:base="/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/"><!-- --> @@ -275,7 +509,7 @@ Benchmarking <code class="highlighter-rouge">ray.put</code> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">test_objects</span><span class="p">)):</span> <span class="n">plot</span><span class="p">(</span><span class="o">*</span><span class="n">benchmark_object</span><span class="p">(</span><span class="n">test_objects</span><span class="p">[</span><span class="n">i</span><span class="p">]),</span> <span class="n">titles</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">i</span><span class="p">)</span> </code></pre> -</div></content><author><name>Philipp Moritz, Robert Nishihara</name></author><summary type="html">This post describes how serialization works in Ray.</summary></entry><entry><title type="html">Apache Arrow 0.7.0 Release</title><link href="/blog/2017/09/18/0.7.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.7.0 Release" /><published>2017-09-18T21:00:00-07:00</published><updated>2017-09-18T21:00:00-07:00</updated><id>/blog/2017/09/18/0.7.0-release</id><content type="html" xml:base="/blog/2017/09/18/0.7.0-release/"><!-- +</div></content><author><name>Philipp Moritz, Robert Nishihara</name></author><summary type="html">This post describes how serialization works in Ray.</summary></entry><entry><title type="html">Apache Arrow 0.7.0 Release</title><link href="/blog/2017/09/19/0.7.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.7.0 Release" /><published>2017-09-19T00:00:00-04:00</published><updated>2017-09-19T00:00:00-04:00</updated><id>/blog/2017/09/19/0.7.0-release</id><content type="html" xml:base="/blog/2017/09/19/0.7.0-release/"><!-- --> @@ -434,7 +668,7 @@ analytics libraries.</p> <p>We are looking for more JavaScript, R, and other programming language developers to join the project and expand the available implementations and -bindings to more languages.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Apache Arrow 0.6.0 Release</title><link href="/blog/2017/08/15/0.6.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.6.0 Release" /><published>2017-08-15T21:00:00-07:00</published><updated>2017-08-15T21:00:00-07:00</updated><id>/blog/2017/08/15/0.6.0-release</id><content type="html" xml:base="/blog/2017/08/15/0.6.0-release/"><!-- +bindings to more languages.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Apache Arrow 0.6.0 Release</title><link href="/blog/2017/08/16/0.6.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.6.0 Release" /><published>2017-08-16T00:00:00-04:00</published><updated>2017-08-16T00:00:00-04:00</updated><id>/blog/2017/08/16/0.6.0-release</id><content type="html" xml:base="/blog/2017/08/16/0.6.0-release/"><!-- --> @@ -516,7 +750,7 @@ milliseconds, or <code class="highlighter-rouge">'us'</code&g <p>We are still discussing the roadmap to 1.0.0 release on the <a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">developer mailing list</a>. The focus of the 1.0.0 release will likely be memory format stability and hardening integration tests across the remaining data types implemented in -Java and C++. Please join the discussion there.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Plasma In-Memory Object Store</title><link href="/blog/2017/08/07/plasma-in-memory-object-store/" rel="alternate" type="text/html" title="Plasma In-Memory Object Store" /><published>2017-08-07T21:00:00-07:00</published><updated>2017-08-07T21:00:00-07:00</updated><id>/blog/2017/08/07/plasma-in-memory-object-store</id><content type="html" xml:base="/blog/2017/08/07/plasma-in-memory-object-store/"><!-- +Java and C++. Please join the discussion there.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Plasma In-Memory Object Store</title><link href="/blog/2017/08/08/plasma-in-memory-object-store/" rel="alternate" type="text/html" title="Plasma In-Memory Object Store" /><published>2017-08-08T00:00:00-04:00</published><updated>2017-08-08T00:00:00-04:00</updated><id>/blog/2017/08/08/plasma-in-memory-object-store</id><content type="html" xml:base="/blog/2017/08/08/plasma-in-memory-object-store/"><!-- --> @@ -637,7 +871,7 @@ primarily used in <a href="https://github.com/ray-project/ray">R We are looking for a broader set of use cases to help refine Plasmaâs API. In addition, we are looking for contributions in a variety of areas including improving performance and building other language bindings. Please let us know -if you are interested in getting involved with the project.</p></content><author><name>Philipp Moritz and Robert Nishihara</name></author></entry><entry><title type="html">Speeding up PySpark with Apache Arrow</title><link href="/blog/2017/07/26/spark-arrow/" rel="alternate" type="text/html" title="Speeding up PySpark with Apache Arrow" /><published>2017-07-26T09:00:00-07:00</published><updated>2017-07-26T09:00:00-07:00</updated><id>/blog/2017/07/26/spark-arrow</id><content type="html" xml:base="/blog/2017/07/26/spark-arrow/"><!-- +if you are interested in getting involved with the project.</p></content><author><name>Philipp Moritz and Robert Nishihara</name></author></entry><entry><title type="html">Speeding up PySpark with Apache Arrow</title><link href="/blog/2017/07/26/spark-arrow/" rel="alternate" type="text/html" title="Speeding up PySpark with Apache Arrow" /><published>2017-07-26T12:00:00-04:00</published><updated>2017-07-26T12:00:00-04:00</updated><id>/blog/2017/07/26/spark-arrow</id><content type="html" xml:base="/blog/2017/07/26/spark-arrow/"><!-- --> @@ -756,7 +990,7 @@ DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20791&qu <p>Reaching this first milestone was a group effort from both the Apache Arrow and Spark communities. Thanks to the hard work of <a href="https://github.com/wesm">Wes McKinney</a>, <a href="https://github.com/icexelloss">Li Jin</a>, <a href="https://github.com/holdenk">Holden Karau</a>, Reynold Xin, Wenchen Fan, Shane Knapp and many others that -helped push this effort forwards.</p></content><author><name>BryanCutler</name></author></entry><entry><title type="html">Apache Arrow 0.5.0 Release</title><link href="/blog/2017/07/24/0.5.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.5.0 Release" /><published>2017-07-24T21:00:00-07:00</published><updated>2017-07-24T21:00:00-07:00</updated><id>/blog/2017/07/24/0.5.0-release</id><content type="html" xml:base="/blog/2017/07/24/0.5.0-release/"><!-- +helped push this effort forwards.</p></content><author><name>BryanCutler</name></author></entry><entry><title type="html">Apache Arrow 0.5.0 Release</title><link href="/blog/2017/07/25/0.5.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.5.0 Release" /><published>2017-07-25T00:00:00-04:00</published><updated>2017-07-25T00:00:00-04:00</updated><id>/blog/2017/07/25/0.5.0-release</id><content type="html" xml:base="/blog/2017/07/25/0.5.0-release/"><!-- --> @@ -839,7 +1073,7 @@ systems to improve their processing performance and interoperability with other systems.</p> <p>We are discussing the roadmap to a future 1.0.0 release on the <a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">developer -mailing list</a>. Please join the discussion there.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Connecting Relational Databases to the Apache Arrow World with turbodbc</title><link href="/blog/2017/06/16/turbodbc-arrow/" rel="alternate" type="text/html" title="Connecting Relational Databases to the Apache Arrow World with turbodbc" /><published>2017-06-16T01:00:00-07:00</published><updated>2017-06-16T01:00:00-07:00</updated><id>/blog/2017/06/16/turbodbc-arrow</id><content type="html" xml:base="/blog/2017/06/16/turbodbc-arrow/"><!-- +mailing list</a>. Please join the discussion there.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Connecting Relational Databases to the Apache Arrow World with turbodbc</title><link href="/blog/2017/06/16/turbodbc-arrow/" rel="alternate" type="text/html" title="Connecting Relational Databases to the Apache Arrow World with turbodbc" /><published>2017-06-16T04:00:00-04:00</published><updated>2017-06-16T04:00:00-04:00</updated><id>/blog/2017/06/16/turbodbc-arrow</id><content type="html" xml:base="/blog/2017/06/16/turbodbc-arrow/"><!-- --> @@ -918,7 +1152,7 @@ databases.</p> <p>If you would like to learn more about turbodbc, check out the <a href="https://github.com/blue-yonder/turbodbc">GitHub project</a> and the <a href="http://turbodbc.readthedocs.io/">project documentation</a>. If you want to learn more about how turbodbc implements the nitty-gritty details, check out parts <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">one</a> and <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/">two</a> of the -<a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">âMaking of turbodbcâ</a> series at <a href="https://tech.blue-yonder.com/">Blue Yonderâs technology blog</a>.</p></content><author><name>MathMagique</name></author></entry><entry><title type="html">Apache Arrow 0.4.1 Release</title><link href="/blog/2017/06/14/0.4.1-release/" rel="alternate" type="text/html" title="Apache Arrow 0.4.1 Release" /><published>2017-06-14T07:00:00-07:00</published><updated>2017-06-14T07:00:00-07:00</updated><id>/blog/2017/06/14/0.4.1-release</id><content type="html" xml:base="/blog/2017/06/14/0.4.1-release/"><!-- +<a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">âMaking of turbodbcâ</a> series at <a href="https://tech.blue-yonder.com/">Blue Yonderâs technology blog</a>.</p></content><author><name>MathMagique</name></author></entry><entry><title type="html">Apache Arrow 0.4.1 Release</title><link href="/blog/2017/06/14/0.4.1-release/" rel="alternate" type="text/html" title="Apache Arrow 0.4.1 Release" /><published>2017-06-14T10:00:00-04:00</published><updated>2017-06-14T10:00:00-04:00</updated><id>/blog/2017/06/14/0.4.1-release</id><content type="html" xml:base="/blog/2017/06/14/0.4.1-release/"><!-- --> @@ -953,289 +1187,4 @@ team used the PyArrow C++ API introduced in version 0.4.0 to construct <div class="highlighter-rouge"><pre class="highlight"><code>pip install turbodbc conda install turbodbc -c conda-forge </code></pre> -</div></content><author><name>wesm</name></author></entry><entry><title type="html">Apache Arrow 0.4.0 Release</title><link href="/blog/2017/05/22/0.4.0-release/" rel="alternate" type="text/html" title="Apache Arrow 0.4.0 Release" /><published>2017-05-22T21:00:00-07:00</published><updated>2017-05-22T21:00:00-07:00</updated><id>/blog/2017/05/22/0.4.0-release</id><content type="html" xml:base="/blog/2017/05/22/0.4.0-release/"><!-- - ---> - -<p>The Apache Arrow team is pleased to announce the 0.4.0 release of the -project. While only 17 days since the release, it includes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.4.0"><strong>77 resolved -JIRAs</strong></a> with some important new features and bug fixes.</p> - -<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your platform.</p> - -<h3 id="expanded-javascript-implementation">Expanded JavaScript Implementation</h3> - -<p>The TypeScript Arrow implementation has undergone some work since 0.3.0 and can -now read a substantial portion of the Arrow streaming binary format. As this -implementation develops, we will eventually want to include JS in the -integration test suite along with Java and C++ to ensure wire -cross-compatibility.</p> - -<h3 id="python-support-for-apache-parquet-on-windows">Python Support for Apache Parquet on Windows</h3> - -<p>With the <a href="https://github.com/apache/parquet-cpp/releases/tag/apache-parquet-cpp-1.1.0">1.1.0 C++ release</a> of <a href="http://parquet.apache.org">Apache Parquet</a>, we have enabled the -<code class="highlighter-rouge">pyarrow.parquet</code> extension on Windows for Python 3.5 and 3.6. This should -appear in conda-forge packages and PyPI in the near future. Developers can -follow the <a href="http://arrow.apache.org/docs/python/development.html">source build instructions</a>.</p> - -<h3 id="generalizing-arrow-streams">Generalizing Arrow Streams</h3> - -<p>In the 0.2.0 release, we defined the first version of the Arrow streaming -binary format for low-cost messaging with columnar data. These streams presume -that the message components are written as a continuous byte stream over a -socket or file.</p> - -<p>We would like to be able to support other other transport protocols, like -<a href="http://grpc.io/">gRPC</a>, for the message components of Arrow streams. To that end, in C++ we -defined an abstract stream reader interface, for which the current contiguous -streaming format is one implementation:</p> - -<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="k">class</span> <span class="nc">RecordBatchReader</span> <span class="p">{</span> - <span class="k">public</span><span class="o">:</span> - <span class="k">virtual</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">Schema</span><span class="o">&gt;</span> <span class="n">schema</span><span class="p">()</span> <span class="k">const</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> - <span class="k">virtual</span> <span class="n">Status</span> <span class="n">GetNextRecordBatch</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">RecordBatch</span><span class="o">&gt;*</span> <span class="n">batch</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> -<span class="p">};</span></code></pre></figure> - -<p>It would also be good to define abstract stream reader and writer interfaces in -the Java implementation.</p> - -<p>In an upcoming blog post, we will explain in more depth how Arrow streams work, -but you can learn more about them by reading the <a href="http://arrow.apache.org/docs/ipc.html">IPC specification</a>.</p> - -<h3 id="c-and-cython-api-for-python-extensions">C++ and Cython API for Python Extensions</h3> - -<p>As other Python libraries with C or C++ extensions use Apache Arrow, they will -need to be able to return Python objects wrapping the underlying C++ -objects. In this release, we have implemented a prototype C++ API which enables -Python wrapper objects to be constructed from C++ extension code:</p> - -<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="cp">#include "arrow/python/pyarrow.h" -</span> -<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">arrow</span><span class="o">::</span><span class="n">py</span><span class="o">::</span><span class="n">import_pyarrow</span><span class="p">())</span> <span class="p">{</span> - <span class="c1">// Error -</span><span class="p">}</span> - -<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">arrow</span><span class="o">::</span><span class="n">RecordBatch</span><span class="o">&gt;</span> <span class="n">cpp_batch</span> <span class="o">=</span> <span class="n">GetData</span><span class="p">(...);</span> -<span class="n">PyObject</span><span class="o">*</span> <span class="n">py_batch</span> <span class="o">=</span> <span class="n">arrow</span><span class="o">::</span><span class="n">py</span><span class="o">::</span><span class="n">wrap_batch</span><span class="p">(</span><span class="n">cpp_batch</span><span class="p">);</span></code></pre></figure> - -<p>This API is intended to be usable from Cython code as well:</p> - -<figure class="highlight"><pre><code class="language-cython" data-lang="cython">cimport pyarrow -pyarrow.import_pyarrow()</code></pre></figure> - -<h3 id="python-wheel-installers-on-macos">Python Wheel Installers on macOS</h3> - -<p>With this release, <code class="highlighter-rouge">pip install pyarrow</code> works on macOS (OS X) as well as -Linux. We are working on providing binary wheel installers for Windows as well.</p></content><author><name>wesm</name></author></entry><entry><title type="html">Apache Arrow 0.3.0 Release</title><link href="/blog/2017/05/07/0.3-release/" rel="alternate" type="text/html" title="Apache Arrow 0.3.0 Release" /><published>2017-05-07T21:00:00-07:00</published><updated>2017-05-07T21:00:00-07:00</updated><id>/blog/2017/05/07/0.3-release</id><content type="html" xml:base="/blog/2017/05/07/0.3-release/"><!-- - ---> - -<p>Translations: <a href="/blog/2017/05/07/0.3-release-japanese/">æ¥æ¬èª</a></p> - -<p>The Apache Arrow team is pleased to announce the 0.3.0 release of the -project. It is the product of an intense 10 weeks of development since the -0.2.0 release from this past February. It includes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.3.0"><strong>306 resolved JIRAs</strong></a> -from <a href="https://github.com/apache/arrow/graphs/contributors"><strong>23 contributors</strong></a>.</p> - -<p>While we have added many new features to the different Arrow implementations, -one of the major development focuses in 2017 has been hardening the in-memory -format, type metadata, and messaging protocol to provide a <strong>stable, -production-ready foundation</strong> for big data applications. We are excited to be -collaborating with the <a href="http://spark.apache.org">Apache Spark</a> and <a href="http://www.geomesa.org/">GeoMesa</a> communities on -utilizing Arrow for high performance IO and in-memory data processing.</p> - -<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your platform.</p> - -<p>We will be publishing more information about the Apache Arrow roadmap as we -forge ahead with using Arrow to accelerate big data systems.</p> - -<p>We are looking for more contributors from within our existing communities and -from other communities (such as Go, R, or Julia) to get involved in Arrow -development.</p> - -<h3 id="file-and-streaming-format-hardening">File and Streaming Format Hardening</h3> - -<p>The 0.2.0 release brought with it the first iterations of the <strong>random access</strong> -and <strong>streaming</strong> Arrow wire formats. See the <a href="http://arrow.apache.org/docs/ipc.html">IPC specification</a> for -implementation details and <a href="http://wesmckinney.com/blog/arrow-streaming-columnar/">example blog post</a> with some use cases. These -provide low-overhead, zero-copy access to Arrow record batch payloads.</p> - -<p>In 0.3.0 we have solidified a number of small details with the binary format -and improved our integration and unit testing particularly in the Java, C++, -and Python libraries. Using the <a href="http://github.com/google/flatbuffers">Google Flatbuffers</a> project has helped with -adding new features to our metadata without breaking forward compatibility.</p> - -<p>We are not yet ready to make a firm commitment to strong forward compatibility -(in case we find something needs to change) in the binary format, but we will -make efforts between major releases to not make unnecessary -breakages. Contributions to the website and component user and API -documentation would also be most welcome.</p> - -<h3 id="dictionary-encoding-support">Dictionary Encoding Support</h3> - -<p><a href="https://github.com/elahrvivaz">Emilio Lahr-Vivaz</a> from the <a href="http://www.geomesa.org/">GeoMesa</a> project contributed Java support -for dictionary-encoded Arrow vectors. We followed up with C++ and Python -support (and <code class="highlighter-rouge">pandas.Categorical</code> integration). We have not yet implemented -full integration tests for dictionaries (for sending this data between C++ and -Java), but hope to achieve this in the 0.4.0 Arrow release.</p> - -<p>This common data representation technique for categorical data allows multiple -record batches to share a common âdictionaryâ, with the values in the batches -being represented as integers referencing the dictionary. This data is called -âcategoricalâ or âfactorâ in statistical languages, while in file formats like -Apache Parquet it is strictly used for data compression.</p> - -<h3 id="expanded-date-time-and-fixed-size-types">Expanded Date, Time, and Fixed Size Types</h3> - -<p>A notable omission from the 0.2.0 release was complete and integration-tested -support for the gamut of date and time types that occur in the wild. These are -needed for <a href="http://parquet.apache.org">Apache Parquet</a> and Apache Spark integration.</p> - -<ul> - <li><strong>Date</strong>: 32-bit (days unit) and 64-bit (milliseconds unit)</li> - <li><strong>Time</strong>: 64-bit integer with unit (second, millisecond, microsecond, nanosecond)</li> - <li><strong>Timestamp</strong>: 64-bit integer with unit, with or without timezone</li> - <li><strong>Fixed Size Binary</strong>: Primitive values occupying certain number of bytes</li> - <li><strong>Fixed Size List</strong>: List values with constant size (no separate offsets vector)</li> -</ul> - -<p>We have additionally added experimental support for exact decimals in C++ using -<a href="https://github.com/boostorg/multiprecision">Boost.Multiprecision</a>, though we have not yet hardened the Decimal memory -format between the Java and C++ implementations.</p> - -<h3 id="c-and-python-support-on-windows">C++ and Python Support on Windows</h3> - -<p>We have made many general improvements to development and packaging for general -C++ and Python development. 0.3.0 is the first release to bring full C++ and -Python support for Windows on Visual Studio (MSVC) 2015 and 2017. In addition -to adding Appveyor continuous integration for MSVC, we have also written guides -for building from source on Windows: <a href="https://github.com/apache/arrow/blob/master/cpp/apidoc/Windows.md">C++</a> and <a href="https://github.com/apache/arrow/blob/master/python/doc/source/development.rst">Python</a>.</p> - -<p>For the first time, you can install the Arrow Python library on Windows from -<a href="https://conda-forge.github.io">conda-forge</a>:</p> - -<div class="language-shell highlighter-rouge"><pre class="highlight"><code>conda install pyarrow -c conda-forge -</code></pre> -</div> - -<h3 id="c-glib-bindings-with-support-for-ruby-lua-and-more">C (GLib) Bindings, with support for Ruby, Lua, and more</h3> - -<p><a href="http://github.com/kou">Kouhei Sutou</a> is a new Apache Arrow contributor and has contributed GLib C -bindings (to the C++ libraries) for Linux. Using a C middleware framework -called <a href="https://wiki.gnome.org/Projects/GObjectIntrospection">GObject Introspection</a>, it is possible to use these bindings -seamlessly in Ruby, Lua, Go, and <a href="https://wiki.gnome.org/Projects/GObjectIntrospection/Users">other programming languages</a>. We will -probably need to publish some follow up blogs explaining how these bindings -work and how to use them.</p> - -<h3 id="apache-spark-integration-for-pyspark">Apache Spark Integration for PySpark</h3> - -<p>We have been collaborating with the Apache Spark community on <a href="https://issues.apache.org/jira/browse/SPARK-13534">SPARK-13534</a> -to add support for using Arrow to accelerate <code class="highlighter-rouge">DataFrame.toPandas</code> in -PySpark. We have observed over <a href="https://github.com/apache/spark/pull/15821#issuecomment-282175163"><strong>40x speedup</strong></a> from the more efficient -data serialization.</p> - -<p>Using Arrow in PySpark opens the door to many other performance optimizations, -particularly around UDF evaluation (e.g. <code class="highlighter-rouge">map</code> and <code class="highlighter-rouge">filter</code> operations with -Python lambda functions).</p> - -<h3 id="new-python-feature-memory-views-feather-apache-parquet-support">New Python Feature: Memory Views, Feather, Apache Parquet support</h3> - -<p>Arrowâs Python library <code class="highlighter-rouge">pyarrow</code> is a Cython binding for the <code class="highlighter-rouge">libarrow</code> and -<code class="highlighter-rouge">libarrow_python</code> C++ libraries, which handle inteoperability with NumPy, -<a href="http://pandas.pydata.org">pandas</a>, and the Python standard library.</p> - -<p>At the heart of Arrowâs C++ libraries is the <code class="highlighter-rouge">arrow::Buffer</code> object, which is a -managed memory view supporting zero-copy reads and slices. <a href="https://github.com/JeffKnupp">Jeff Knupp</a> -contributed integration between Arrow buffers and the Python buffer protocol -and memoryviews, so now code like this is possible:</p> - -<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">In</span> <span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="kn">import</span> <span class="nn">pyarrow</span> <span class="kn">as</span> <span class="nn">pa</span> - -<span class="n">In</span> <span class="p">[</span><span class="mi">7</span><span class="p">]:</span> <span class="n">buf</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">b</span><span class="s">'foobarbaz'</span><span class="p">)</span> - -<span class="n">In</span> <span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="n">buf</span> -<span class="n">Out</span><span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="o">&lt;</span><span class="n">pyarrow</span><span class="o">.</span><span class="n">_io</span><span class="o">.</span><span class="n">Buffer</span> <span class="n">at</span> <span class="mh">0x7f6c0a84b538</span><span class="o">&gt;</span> - -<span class="n">In</span> <span class="p">[</span><span class="mi">9</span><span class="p">]:</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">buf</span><span class="p">)</span> -<span class="n">Out</span><span class="p">[</span><span class="mi">9</span><span class="p">]:</span> <span class="o">&lt;</span><span class="n">memory</span> <span class="n">at</span> <span class="mh">0x7f6c0a8c5e88</span><span class="o">&gt;</span> - -<span class="n">In</span> <span class="p">[</span><span class="mi">10</span><span class="p">]:</span> <span class="n">buf</span><span class="o">.</span><span class="n">to_pybytes</span><span class="p">()</span> -<span class="n">Out</span><span class="p">[</span><span class="mi">10</span><span class="p">]:</span> <span class="n">b</span><span class="s">'foobarbaz'</span> -</code></pre> -</div> - -<p>We have significantly expanded <a href="http://parquet.apache.org"><strong>Apache Parquet</strong></a> support via the C++ -Parquet implementation <a href="https://github.com/apache/parquet-cpp">parquet-cpp</a>. This includes support for partitioned -datasets on disk or in HDFS. We added initial Arrow-powered Parquet support <a href="https://github.com/dask/dask/commit/68f9e417924a985c1f2e2a587126833c70a2e9f4">in -the Dask project</a>, and look forward to more collaborations with the Dask -developers on distributed processing of pandas data.</p> - -<p>With Arrowâs support for pandas maturing, we were able to merge in the -<a href="https://github.com/wesm/feather"><strong>Feather format</strong></a> implementation, which is essentially a special case of -the Arrow random access format. Weâll be continuing Feather development within -the Arrow codebase. For example, Feather can now read and write with Python -file objects using Arrowâs Python binding layer.</p> - -<p>We also implemented more robust support for pandas-specific data types, like -<code class="highlighter-rouge">DatetimeTZ</code> and <code class="highlighter-rouge">Categorical</code>.</p> - -<h3 id="support-for-tensors-and-beyond-in-c-library">Support for Tensors and beyond in C++ Library</h3> - -<p>There has been increased interest in using Apache Arrow as a tool for zero-copy -shared memory management for machine learning applications. A flagship example -is the <a href="https://github.com/ray-project/ray">Ray project</a> from the UC Berkeley <a href="https://rise.cs.berkeley.edu/">RISELab</a>.</p> - -<p>Machine learning deals in additional kinds of data structures beyond what the -Arrow columnar format supports, like multidimensional arrays aka âtensorsâ. As -such, we implemented the <a href="http://arrow.apache.org/docs/cpp/classarrow_1_1_tensor.html"><code class="highlighter-rouge">arrow::Tensor</code></a> C++ type which can utilize the -rest of Arrowâs zero-copy shared memory machinery (using <code class="highlighter-rouge">arrow::Buffer</code> for -managing memory lifetime). In C++ in particular, we will want to provide for -additional data structures utilizing common IO and memory management tools.</p> - -<h3 id="start-of-javascript-typescript-implementation">Start of JavaScript (TypeScript) Implementation</h3> - -<p><a href="https://github.com/TheNeuralBit">Brian Hulette</a> started developing an Arrow implementation in -<a href="https://github.com/apache/arrow/tree/master/js">TypeScript</a> for use in NodeJS and browser-side applications. We are -benefitting from Flatbuffersâ first class support for JavaScript.</p> - -<h3 id="improved-website-and-developer-documentation">Improved Website and Developer Documentation</h3> - -<p>Since 0.2.0 we have implemented a new website stack for publishing -documentation and blogs based on <a href="https://jekyllrb.com">Jekyll</a>. Kouhei Sutou developed a <a href="https://github.com/red-data-tools/jekyll-jupyter-notebook">Jekyll -Jupyter Notebook plugin</a> so that we can use Jupyter to author content for -the Arrow website.</p> - -<p>On the website, we have now published API documentation for the C, C++, Java, -and Python subcomponents. Within these you will find easier-to-follow developer -instructions for getting started.</p> - -<h3 id="contributors">Contributors</h3> - -<p>Thanks to all who contributed patches to this release.</p> - -<div class="highlighter-rouge"><pre class="highlight"><code>$ git shortlog -sn apache-arrow-0.2.0..apache-arrow-0.3.0 - 119 Wes McKinney - 55 Kouhei Sutou - 18 Uwe L. Korn - 17 Julien Le Dem - 9 Phillip Cloud - 6 Bryan Cutler - 5 Philipp Moritz - 5 Emilio Lahr-Vivaz - 4 Max Risuhin - 4 Johan Mabille - 4 Jeff Knupp - 3 Steven Phillips - 3 Miki Tebeka - 2 Leif Walsh - 2 Jeff Reback - 2 Brian Hulette - 1 Tsuyoshi Ozawa - 1 rvernica - 1 Nong Li - 1 Julien Lafaye - 1 Itai Incze - 1 Holden Karau - 1 Deepak Majeti -</code></pre> </div></content><author><name>wesm</name></author></entry></feed> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/arrow-site/blob/0a7dc418/install/index.html ---------------------------------------------------------------------- diff --git a/install/index.html b/install/index.html index c225aa5..9c82739 100644 --- a/install/index.html +++ b/install/index.html @@ -127,8 +127,9 @@ <ul> <li><strong>Source Release</strong>: <a href="https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.8.0/apache-arrow-0.8.0.tar.gz">apache-arrow-0.8.0.tar.gz</a></li> - <li><strong>Verification</strong>: <a href="https://dist.apache.org/repos/dist/release/arrow/arrow-0.8.0/apache-arrow-0.8.0.tar.gz.sha512">sha512</a>, <a href="https://dist.apache.org/repos/dist/release/arrow/arrow-0.8.0/apache-arrow-0.8.0.tar.gz.asc">asc</a></li> + <li><strong>Verification</strong>: <a href="https://www.apache.org/dist/arrow/arrow-0.8.0/apache-arrow-0.8.0.tar.gz.sha512">sha512</a>, <a href="https://www.apache.org/dist/arrow/arrow-0.8.0/apache-arrow-0.8.0.tar.gz.asc">asc</a> (<a href="https://www.apache.org/dyn/closer.cgi#verify">verification instructions</a>)</li> <li><a href="https://github.com/apache/arrow/releases/tag/apache-arrow-0.8.0">Git tag 1d689e5</a></li> + <li><a href="http://www.apache.org/dist/arrow/KEYS">PGP keys for release signatures</a></li> </ul> <h3 id="java-packages">Java Packages</h3>
