kevingurney opened a new pull request, #36366:
URL: https://github.com/apache/arrow/pull/36366

   ### Rationale for this change
   
   Thanks to @sgilmore10's [recent changes to enable UTF-8 <-> UTF-16 string 
conversions](#36167), we can now add support for creating Arrow `String` arrays 
(UTF-8 encoded) from MATLAB `string` arrays (UTF-16 encoded).
   
   ### What changes are included in this PR?
   
   1. Added new `arrow.array.StringArray` class that can be converted to / from 
MATLAB `string` and `cellstr` types. **Note**: We explicitly decided to *not* 
support `char` arrays for the time being.
   2. Factored out code for extracting "raw' `const uint8_t*` from a MATLAB 
`logical` Data Array into a new function `bit::unpacked_as_ptr` so that it can 
be reused across multiple Array `Proxy` classes.
   3. Added new `arrow.type.StringType` type class and associated 
arrow.type.ID.String` enum value.
   4. Enabled support for creating `RecordBatch` objects from MATLAB `table`s 
containing `string` data.
   
   **Examples**
   
   *Most MATLAB `string` arrays round-trip*
   
   ```matlab
   >> matlabArray = ["A"; "B"; "C"]
   
   matlabArray = 
   
     3x1 string array
   
       "A"
       "B"
       "C"
   
   >> arrowArray = arrow.array.StringArray(matlabArray)
   
   arrowArray = 
   
   [
     "A",
     "B",
     "C"
   ]
   >> matlabArrayRoundTrip = toMATLAB(arrowArray)          
   
   matlabArrayRoundTrip = 
   
     3x1 string array
   
       "A"
       "B"
       "C"
   
   >> isequal(matlabArray, matlabArrayRoundTrip)
   
   ans =
   
     logical
   
      1
   ```
   
   *MATLAB `string(missing)` Values get mapped to `null` by default*
   
   ```matlab
   >> matlabArray = ["A"; string(missing); "C"]
   
   matlabArray = 
   
     3x1 string array
   
       "A"
       <missing>
       "C"
   
   >> arrowArray = arrow.array.StringArray(matlabArray) 
   
   arrowArray = 
   
   [
     "A",
     null,
     "C"
   ]
   >> matlabArrayRoundTrip = toMATLAB(arrowArray) 
   
   matlabArrayRoundTrip = 
   
     3x1 string array
   
       "A"
       <missing>
       "C"
   
   >> isequaln(matlabArray, matlabArrayRoundTrip)
   
   ans =
   
     logical
   
      1
   
   ```
   
   *Unicode characters round-trip*
   
   ```matlab
   >> matlabArray = ["😊"; "🌲"; "➞"]
   
   matlabArray = 
   
     3×1 string array
   
       "😊"
       "🌲"
       "âžž"
   
   >> arrowArray = arrow.array.StringArray(matlabArray)
   
   arrowArray = 
   
   [
     "😊",
     "🌲",
     "âžž"
   ]
   
   >> matlabArrayRoundTrip = toMATLAB(arrowArray)
   
   matlabArrayRoundTrip = 
   
     3×1 string array
   
       "😊"
       "🌲"
       "âžž"
   ```
   
   ### Are these changes tested?
   
   Yes.
   
   1. Added new `tStringArray` test class.
   2. Added new `tStringType` test class.
   3. Extended `tRecordBatch` test class to verify support for MATLAB `table`s 
which contain `string` data (see above).
   
   ### Are there any user-facing changes?
   
   Yes.
   
   1. Users can now create `arrow.array.StringArray` objects from MATLAB 
`string` arrays and `cellstr`s.
   2. Users can now create `arrow.type.StringType` objects.
   3. Users can now construct `RecordBatch` objects from MATLAB `table`s that 
contain `string` data.
   
   ### Future Directions
   
   1. The implementation of this initial version of `StringArray` is relatively 
simple in that it does not include a `BinaryArray` class hierarchy. In the 
future, we will likely want to refactor `StringArray` to inherit from a more 
general abstract `BinaryArray` class hierarchy.
   2. Following on from 1., we will ideally want to add support for 
`LargeStringArray`, `BinaryArray`, and `LargeBinaryArray`, and 
`FixedLengthBinarArray` by creating common infrastructure for binary types. 
This first attempt is to help solidfy the user-facing design and provide a 
shorter term solution to working with `string` data since it is common.
   3. It may make sense to change the `arrow.type.Type` hierarchy (e.g. 
`arrow.type.StringType`) in the future to delegate to C++ `Proxy` classes under 
the hood. See: #36363.
   
   ### Notes
   
   1. Thank you @sgilmore10 for your help with this pull request!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to