kevingurney opened a new pull request, #36190:
URL: https://github.com/apache/arrow/pull/36190
### Rationale for this change
Now that the MATLAB interface supports some basic `arrow.array.Array` types,
it would be helpful to start building out the tabular types (e.g. `RecordBatch`
and `Table`) in parallel.
This pull request contains a basic implementation of
`arrow.tabular.RecordBatch` (name subject to change).
### What changes are included in this PR?
1. Added new `arrow.tabular.RecordBatch` class that can be constructed from
a MATLAB `table`.
2. Added new test class `tRecordBatch`.
### Are these changes tested?
Yes.
1. Added new test class `tRecordBatch` containing basic tests for the
`arrow.tabular.RecordBatch` class.
### Are there any user-facing changes?
Yes.
1. Added new class `arrow.tabular.RecordBatch`.
**Example**:
```matlab
>> matlabTable = table(uint64([1,2,3]'), [true false true]', [0.1, 0.2,
0.3]', VariableNames=["UInt64", "Boolean", "Float64"])
matlabTable =
3x3 table
UInt64 Boolean Float64
______ _______ _______
1 true 0.1
2 false 0.2
3 true 0.3
>> arrowRecordBatch = arrow.tabular.RecordBatch(matlabTable)
arrowRecordBatch =
UInt64: [
1,
2,
3
]
Boolean: [
true,
false,
true
]
Float64: [
0.1,
0.2,
0.3
]
>> convertedMatlabTable = table(arrowRecordBatch)
convertedMatlabTable =
3x3 table
UInt64 Boolean Float64
______ _______ _______
1 true 0.1
2 false 0.2
3 true 0.3
>> isequal(matlabTable, convertedMatlabTable)
ans =
logical
1
```
2. Added properties `NumColumns` and `ColumnNames` to
`arrow.tabular.RecordBatch`:
**Example**:
```matlab
>> arrowRecordBatch.NumColumns
ans =
int32
3
>> arrowRecordBatch.ColumnNames
ans =
1x3 string array
"UInt64" "Boolean" "Float64"
```
3. Added `column(i)` method to `arrow.tabular.RecordBatch` to retrieve the
`i`th column of a `RecordBatch` as an `arrow.array.Array`.
**Example**:
```matlab
>> arrowUInt64Array = arrowRecordBatch.column(1)
arrowUInt64Array =
[
1,
2,
3
]
>> class(arrowUInt64Array)
ans =
'arrow.array.UInt64Array'
>> arrowBooleanArray = arrowRecordBatch.column(2)
arrowBooleanArray =
[
true,
false,
true
]
>> class(arrowBooleanArray)
ans =
'arrow.array.UInt64Array'
>> arrowFloat64Array = arrowRecordBatch.column(3)
arrowFloat64Array =
[
0.1,
0.2,
0.3
]
>> class(arrowFloat64Array)
ans =
'arrow.array.Float64Array'
```
### Future Directions
1. Implement C++ logic for `toMATLAB` when the Arrow memory for a
`RecordBatch` did originate from a MATLAB array (e.g. read from a Parquet file
or somewhere else).
2. Add more supported construction interfaces (e.g.
`arrow.tabular.RecordBatch(array1, ..., arrayN)`,
arrow.tabular.RecordBatch.fromArrays(arrays)`, etc.).
3. Create an `arrow.tabular.Schema` class. Expose this as a public property
on the `RecordBatch` class. Create related `arrow.type.Field` and
`arrow.type.Type` classes.
4. Create an `arrow.tabular.Table` and related `arrow.array.ChunkedArray`
class.
5. Add more `arrow.array.Array` types (e.g. `StringArray`, `TimestampArray`,
`Time64Array`).
6. Create a basic workflow example of serializing a `RecordBatch` to disk
using an I/O function (e.g. Parquet writing).
### Notes
1. Thanks @sgilmore10 for your help with this pull request!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]