[
https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jorge updated ARROW-10030:
--------------------------
Description:
Proposal for comments:
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]
(dump of the document above)
Rust Arrow supports two main computational models:
# Batch Operations, that leverage some form of vectorization
# Element-by-element operations, that emerge in more complex operations
This document concerns element-by-element operations, that are common outside
of the library (and sometimes in the library).
h2. Element-by-element operations
These operations are programmatically written as:
# Downcast the array to its specific type
# Initialize buffers
# Iterate over indices and perform the operation, appending to the buffers
accordingly
# Create ArrayData with the required null bitmap, buffers, childs, etc.
# return ArrayRef from ArrayData
We can split this process in 3 parts:
# Initialization (1 and 2)
# Iteration (3)
# Finalization (4 and 5)
Currently, the API that we offer to our users is:
# as_any() to downcast the array based on its DataType
# Builders for all types, that users can initialize, matching the downcasted
array
# Iterate
## Use for i in (0..array.len())
## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}}
## use builder.append_value(new_value) or builder.append_null()
# Finish the builder and wrap the result in an Arc
This API has some issues:
# value(i) +is unsafe+, even though it is not marked as such
# builders are usually slow due to the checks that they need to perform
# The API is not intuitive
h2. Proposal
This proposal aims at improving this API in 2 specific ways:
* Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
* Implement FromIterator<Item=T> and Item=Option<T>
so that users can write:
{code:java}
// incoming array
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
let array = Arc::new(array) as ArrayRef;
let array = array.as_any().downcast_ref::<Int32Array>().unwrap();
// to and from iter, with a +1
let result: Int32Array = array
.iter()
.map(|e| if let Some(r) = e { Some(r + 1) } else { None })
.collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]);
assert_eq!(result, expected);
{code}
This results in an API that is:
# efficient, as it is our responsibility to create `FromIterator` that are
efficient in populating the buffers/child etc from an iterator
# Safe, as it does not allow segfaults
# Simple, as users do not need to worry about Builders, buffers, etc, only
native Rust.
was:
Proposal for comments:
[https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]
(dump of the document above)
Rust Arrow supports two main computational models:
# Batch Operations, that leverage some form of vectorization
# Element-by-element operations, that emerge in more complex operations
This document concerns element-by-element operations, that are common outside
of the library (and sometimes in the library).
h2. Element-by-element operations
These operations are programmatically written as:
# Downcast the array to its specific type
# Initialize buffers
# Iterate over indices and perform the operation, appending to the buffers
accordingly
# Create ArrayData with the required null bitmap, buffers, childs, etc.
# return ArrayRef from ArrayData
We can split this process in 3 parts:
# Initialization (1 and 2)
# Iteration (3)
# Finalization (4 and 5)
Currently, the API that we offer to our users is:
# as_any() to downcast the array based on its DataType
# Builders for all types, that users can initialize, matching the downcasted
array
# Iterate
# Use for i in (0..array.len())
# Use Array::value(i) and Array::is_valid(i)/is_null(i)`
# use builder.append_value(new_value) or builder.append_null()
# Finish the builder and wrap the result in an Arc
This API has some issues:
# value(i) +is unsafe+, even though it is not marked as such
# builders are usually slow due to the checks that they need to perform
# The API is not intuitive
h2. Proposal
This proposal aims at improving this API in 2 specific ways:
* Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
* Implement FromIterator<Item=T> and Item=Option<T>
so that users can write:
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array
.iter()
.map(|e| if let Some(r) = e { Some(r + 1) } else { None })
.collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]);
assert_eq!(result, expected);
{code}
This results in an API that is:
# efficient, as it is our responsibility to create `FromIterator` that are
efficient in populating the buffers/child etc from an iterator
# Safe, as it does not allow segfaults
# Simple, as users do not need to worry about Builders, buffers, etc, only
native Rust.
> [Rust] Support fromIter and toIter
> ----------------------------------
>
> Key: ARROW-10030
> URL: https://issues.apache.org/jira/browse/ARROW-10030
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Reporter: Jorge
> Priority: Major
>
> Proposal for comments:
> [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]
> (dump of the document above)
> Rust Arrow supports two main computational models:
> # Batch Operations, that leverage some form of vectorization
> # Element-by-element operations, that emerge in more complex operations
> This document concerns element-by-element operations, that are common outside
> of the library (and sometimes in the library).
> h2. Element-by-element operations
> These operations are programmatically written as:
> # Downcast the array to its specific type
> # Initialize buffers
> # Iterate over indices and perform the operation, appending to the buffers
> accordingly
> # Create ArrayData with the required null bitmap, buffers, childs, etc.
> # return ArrayRef from ArrayData
>
> We can split this process in 3 parts:
> # Initialization (1 and 2)
> # Iteration (3)
> # Finalization (4 and 5)
> Currently, the API that we offer to our users is:
> # as_any() to downcast the array based on its DataType
> # Builders for all types, that users can initialize, matching the downcasted
> array
> # Iterate
> ## Use for i in (0..array.len())
> ## Use {{Array::value(i)}} and {{Array::is_valid(i)/is_null(i)}}
> ## use builder.append_value(new_value) or builder.append_null()
> # Finish the builder and wrap the result in an Arc
> This API has some issues:
> # value(i) +is unsafe+, even though it is not marked as such
> # builders are usually slow due to the checks that they need to perform
> # The API is not intuitive
> h2. Proposal
> This proposal aims at improving this API in 2 specific ways:
> * Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
> * Implement FromIterator<Item=T> and Item=Option<T>
> so that users can write:
> {code:java}
> // incoming array
> let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
> let array = Arc::new(array) as ArrayRef;
> let array = array.as_any().downcast_ref::<Int32Array>().unwrap();
> // to and from iter, with a +1
> let result: Int32Array = array
> .iter()
> .map(|e| if let Some(r) = e { Some(r + 1) } else { None })
> .collect();
> let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]);
> assert_eq!(result, expected);
> {code}
>
> This results in an API that is:
> # efficient, as it is our responsibility to create `FromIterator` that are
> efficient in populating the buffers/child etc from an iterator
> # Safe, as it does not allow segfaults
> # Simple, as users do not need to worry about Builders, buffers, etc, only
> native Rust.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)