[jira] [Updated] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

Paul Rogers (JIRA) Mon, 03 Sep 2018 16:41:28 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Rogers updated ARROW-3164:
-------------------------------
    Description: 
Arrow is a great way to exchange data between systems. Somewhere in the 
process, however, data must be load into, and read out of the Arrow vectors.

Arrow's vector code started with similar code in Apache Drill. The Drill 
project created a "Row Set" abstraction that:
 * Provides a simple way to define the schema for a set of batches.
 * Loads data into vectors from row-oriented inputs.
 * Reads data out of vectors in row-oriented output.
 * Controls memory consumed by the record batch when loading data into vectors.
 * Ensures maximum usage of the allocated vector space when loading data Into 
vectors.
 * Optionally handles projection when reading data from an input file into a 
set of vectors.
 * Optionally handles data conversion from input to vector formats.

This mechanism is handy for any Java developer who produces or consumes Arrow 
vectors.

Detailed information is available in [this 
wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
description of the motivation for this project, and an analysis of the work 
required to do the Drill-to-Arrow port.

The code is in Java simply because Drill is written in Java. The same 
mechanisms can be ported to other languages if useful. Those ports would be 
separate future projects.

The code will be placed in a new Java module which can be imported by projects 
that wish to use the code. Changes may be needed to expose items from the 
{{vector}} module; we'll tackle those issues if/when they occur.

  was:
Arrow is a great way to exchange data between systems. Somewhere in the 
process, however, data must be load into, and read out of the Arrow vectors.

Arrow evolved from Apache Drill. The Drill project created a "Row Set" 
abstraction that:

* Provides a simple way to define the schema for a set of batches.
* Loads data into vectors from row-oriented inputs.
* Reads data out of vectors in row-oriented output.
* Controls memory consumed by the record batch when loading data into vectors.
* Ensures maximum usage of the allocated vector space when loading data Into 
vectors.
* Optionally handles projection when reading data from an input file into a set 
of vectors.
* Optionally handles data conversion from input to vector formats.

This mechanism is handy for any Java developer who produces or consumes Arrow 
vectors.

Detailed information is available in [this 
wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
description of the motivation for this project, and an analysis of the work 
required to do the Drill-to-Arrow port.

The code is in Java simply because Drill is written in Java. The same 
mechanisms can be ported to other languages if useful. Those ports would be 
separate future projects.

The code will be placed in a new Java module which can be imported by projects 
that wish to use the code. Changes may be needed to expose items from the 
{{vector}} module; we'll tackle those issues if/when they occur.


> [Java] Port Row Set abstraction from Drill to Arrow
> ---------------------------------------------------
>
>                 Key: ARROW-3164
>                 URL: https://issues.apache.org/jira/browse/ARROW-3164
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Paul Rogers
>            Priority: Major
>
> Arrow is a great way to exchange data between systems. Somewhere in the 
> process, however, data must be load into, and read out of the Arrow vectors.
> Arrow's vector code started with similar code in Apache Drill. The Drill 
> project created a "Row Set" abstraction that:
>  * Provides a simple way to define the schema for a set of batches.
>  * Loads data into vectors from row-oriented inputs.
>  * Reads data out of vectors in row-oriented output.
>  * Controls memory consumed by the record batch when loading data into 
> vectors.
>  * Ensures maximum usage of the allocated vector space when loading data Into 
> vectors.
>  * Optionally handles projection when reading data from an input file into a 
> set of vectors.
>  * Optionally handles data conversion from input to vector formats.
> This mechanism is handy for any Java developer who produces or consumes Arrow 
> vectors.
> Detailed information is available in [this 
> wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
> description of the motivation for this project, and an analysis of the work 
> required to do the Drill-to-Arrow port.
> The code is in Java simply because Drill is written in Java. The same 
> mechanisms can be ported to other languages if useful. Those ports would be 
> separate future projects.
> The code will be placed in a new Java module which can be imported by 
> projects that wish to use the code. Changes may be needed to expose items 
> from the {{vector}} module; we'll tackle those issues if/when they occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

Reply via email to