This is an automated email from the ASF dual-hosted git repository.

kevingurney pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 95db0df5f2 GH-37046: [MATLAB] Implement `featherread` in terms of 
`arrow.internal.io.feather.Reader` (#37163)
95db0df5f2 is described below

commit 95db0df5f213bb412c0e1c3c53f462c0255f0b1c
Author: Kevin Gurney <[email protected]>
AuthorDate: Mon Aug 14 17:11:20 2023 -0400

    GH-37046: [MATLAB] Implement `featherread` in terms of 
`arrow.internal.io.feather.Reader` (#37163)
    
    ### Rationale for this change
    
    Now that #37044 is merged, we can re-implement `featherread` in terms of 
the new `arrow.internal.io.feather.Reader` class.
    
    Once this change is made, we can delete the legacy build infrastructure and 
`featherread` MEX code.
    
    ### What changes are included in this PR?
    
    1. Reimplemented `featherread` in terms of the new 
`arrow.internal.io.feather.Reader` class.
    2. We tried to maintain compatibility with the old code as much as 
possible, but since `featherread` is now implemented in terms of `RecordBatch`, 
there are some minor changes in behavior and support for some new data types 
(e.g. `Boolean`, `String`, `Timestamp`) that are introduced by these changes.
    3. Updated `arrow/matlab/io/feather/proxy/reader.cc` to prevent a `nullptr` 
dereference that was occurring when reading a Feather V1 file created from an 
empty table by using `Table::CombineChunksToBatch` rather than a 
`TableBatchReader`.
    
    **Example**
    ```matlab
    >> tWrite = table(["A"; "B"; "C"], [true; false; true], [1; 2; 3], 
VariableNames=["String", "Boolean", "Float64"])
    
    tWrite =
    
      3x3 table
    
        String    Boolean    Float64
        ______    _______    _______
    
         "A"       true         1
         "B"       false        2
         "C"       true         3
    
    >> featherwrite("test.feather", tWrite)
    
    >> tRead = featherread("test.feather")
    
    tRead =
    
      3x3 table
    
        String    Boolean    Float64
        ______    _______    _______
    
         "A"       true         1
         "B"       false        2
         "C"       true         3
    
    >> isequaln(tWrite, tRead)
    
    ans =
    
      logical
    
       1
    ```
    
    ### Are these changes tested?
    
    Yes.
    
    1. Updated the existing `tfeather.m` and `tfeathermex.m` tests to reflect 
the new behavior of `featherread`. This mainly consists of error message ID 
changes.
    2. Added a new test to verify that all MATLAB types supported by 
`arrow.tabular.RecordBatch` can be round-tripped to a Feather V1 file.
    4. Added a new test to verify that a MATLAB `table` with Unicode 
`Variablenames` can be round-tripped to a Feather V1 file.
    
    ### Are there any user-facing changes?
    
    Yes.
    
    1. Now that `featherread` is implemented in terms of 
`arrow.internal.io.feather.Reader` and `arrow.tabular.RecordBatch`, it supports 
reading more types like `Boolean`, `String`, `Timestamp`, etc. **Note**: We 
updated the code to cast `logical`/`Boolean` type columns containing null 
values to `double` and substitute null values with `NaN`. This mirrors the 
existing behavior of `featherread` for integer type columns containing null 
values.
    2. There are some minor error message ID changes.
    4. Cell arrays of strings with a single element (e.g. 
`{'filename.feather'}`) are now supported as a valid `filename` for 
`featherread`.
    
    ### Future Directions
    
    1. In the future, we may want to consider no longer casting columns with 
integer/logical type containing null values to `double` and substituting null 
values with `NaN`. This behavior isn't ideal in all cases (it can be lossy for 
types like `uint64`). This change would break compatibility.
    2. Delete legacy Feather V1 code and build infrastructure.
    
    ### Notes
    
    1. Thank you @ sgilmore10 for your help with this pull request!
    
    * Closes: #37046
    
    Authored-by: Kevin Gurney <[email protected]>
    Signed-off-by: Kevin Gurney <[email protected]>
---
 .../cpp/arrow/matlab/io/feather/proxy/reader.cc    |  6 +-
 matlab/src/matlab/featherread.m                    | 85 +++++++++-------------
 matlab/test/tfeather.m                             | 45 ++++++++++--
 matlab/test/tfeathermex.m                          |  6 +-
 4 files changed, 77 insertions(+), 65 deletions(-)

diff --git a/matlab/src/cpp/arrow/matlab/io/feather/proxy/reader.cc 
b/matlab/src/cpp/arrow/matlab/io/feather/proxy/reader.cc
index a264d24ecb..db33b410f7 100644
--- a/matlab/src/cpp/arrow/matlab/io/feather/proxy/reader.cc
+++ b/matlab/src/cpp/arrow/matlab/io/feather/proxy/reader.cc
@@ -72,10 +72,8 @@ namespace arrow::matlab::io::feather::proxy {
         std::shared_ptr<arrow::Table> table = nullptr;
         MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(reader->Read(&table), context, 
error::FEATHER_FAILED_TO_READ_TABLE);
 
-        // Get the first RecordBatch from the Table.
-        arrow::TableBatchReader table_batch_reader{table};
-        std::shared_ptr<arrow::RecordBatch> record_batch = nullptr;
-        
MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(table_batch_reader.ReadNext(&record_batch), 
context, error::FEATHER_FAILED_TO_READ_RECORD_BATCH);
+        // Combine all the chunks of the Table into a single RecordBatch.
+        MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto record_batch, 
table->CombineChunksToBatch(arrow::default_memory_pool()), context, 
error::FEATHER_FAILED_TO_READ_RECORD_BATCH);
 
         // Create a Proxy from the first RecordBatch.
         auto record_batch_proxy = 
std::make_shared<RecordBatchProxy>(record_batch);
diff --git a/matlab/src/matlab/featherread.m b/matlab/src/matlab/featherread.m
index 5f4952873f..736fe83288 100644
--- a/matlab/src/matlab/featherread.m
+++ b/matlab/src/matlab/featherread.m
@@ -23,64 +23,45 @@ function t = featherread(filename)
 % specific language governing permissions and limitations
 % under the License.
 
-import arrow.util.*;
+    arguments
+        filename(1, 1) string {mustBeNonmissing, mustBeNonzeroLengthText}
+    end
 
-% Validate input arguments.
-narginchk(1, 1);
-filename = convertStringsToChars(filename);
-if ~ischar(filename)
-    error('MATLAB:arrow:InvalidFilenameDatatype', ...
-        'Filename must be a character vector or string scalar.');
-end
+    typesToCast = [arrow.type.ID.UInt8, ...
+                   arrow.type.ID.UInt16, ...
+                   arrow.type.ID.UInt32, ...
+                   arrow.type.ID.UInt64, ...
+                   arrow.type.ID.Int8, ...
+                   arrow.type.ID.Int16, ...
+                   arrow.type.ID.Int32, ...
+                   arrow.type.ID.Int64, ...
+                   arrow.type.ID.Boolean];
 
-% FOPEN can be used to search for files without an extension on the MATLAB
-% path.
-fid = fopen(filename);
-if fid ~= -1
-    filename = fopen(fid);
-    fclose(fid);
-else
-    error('MATLAB:arrow:UnableToOpenFile', ...
-        'Unable to open file %s.', filename);
-end
+    reader = arrow.internal.io.feather.Reader(filename);
+    recordBatch = reader.read();
 
-% Read table variables and metadata from the given Feather file using
-% libarrow.
-[variables, metadata] = arrow.cpp.call('featherread', filename);
+    % Convert RecordBatch to a MATLAB table.
+    t = table(recordBatch);
 
-% Make valid MATLAB table variable names out of any of the
-% Feather table column names that are not valid MATLAB table
-% variable names.
-[variableNames, variableDescriptions] = 
makeValidMATLABTableVariableNames({variables.Name});
-
-% Iterate over each table variable, handling invalid (null) entries
-% and invalid MATLAB table variable names appropriately.
-% Note: All Arrow arrays can have an associated validity (null) bitmap.
-% The Apache Arrow specification defines 0 (false) to represent an
-% invalid (null) array entry and 1 (true) to represent a valid
-% (non-null) array entry.
-for ii = 1:length(variables)
-    if ~all(variables(ii).Valid)
-        switch variables(ii).Type
-            case {'uint8', 'uint16', 'uint32', 'uint64', 'int8', 'int16', 
'int32', 'int64'}
-                % MATLAB does not support missing values for integer types, so
-                % cast to double and set missing values to NaN in this case.
-                variables(ii).Data = double(variables(ii).Data);
+    % Cast integer and boolean columns containing null values
+    % to double. Substitute null values with NaN.
+    for ii = 1:recordBatch.NumColumns
+        array = recordBatch.column(ii);
+        type = array.Type.ID;
+        if any(type == typesToCast) && any(~array.Valid)
+            % Cast to double.
+            t.(ii) = double(t.(ii));
+            % Substitute null values with NaN.
+            t{~array.Valid, ii} = NaN;
         end
-
-        % Set invalid (null) entries to the appropriate MATLAB missing value 
using
-        % logical indexing.
-        variables(ii).Data(~variables(ii).Valid) = missing;
     end
-end
-
-% Construct a MATLAB table from the Feather file data.
-t = table(variables.Data, 'VariableNames', cellstr(variableNames));
 
-% Store original Feather table column names in the 
table.Properties.VariableDescriptions
-% property if they were modified to be valid MATLAB table variable names.
-if ~isempty(variableDescriptions)
-    t.Properties.VariableDescriptions = cellstr(variableDescriptions);
-end
+    % Store original Feather table column names in the 
table.Properties.VariableDescriptions
+    % property if they were modified to be valid MATLAB table variable names.
+    modifiedColumnNameIndices = t.Properties.VariableNames ~= 
recordBatch.ColumnNames;
+    if any(modifiedColumnNameIndices)
+        originalColumnNames = 
recordBatch.ColumnNames(modifiedColumnNameIndices);
+        t.Properties.VariableDescriptions(modifiedColumnNameIndices) = 
compose("Original variable name: '%s'", originalColumnNames);
+    end
 
 end
diff --git a/matlab/test/tfeather.m b/matlab/test/tfeather.m
index e4c988e1dd..8cdbef27fa 100755
--- a/matlab/test/tfeather.m
+++ b/matlab/test/tfeather.m
@@ -143,7 +143,7 @@ classdef tfeather < matlab.unittest.TestCase
         function ErrorIfUnableToOpenFile(testCase)
             filename = fullfile(pwd, 'temp.feather');
 
-            testCase.verifyError(@() featherread(filename), 
'MATLAB:arrow:UnableToOpenFile');
+            testCase.verifyError(@() featherread(filename), 
'arrow:io:FailedToOpenFileForRead');
         end
 
         function ErrorIfCorruptedFeatherFile(testCase)
@@ -156,16 +156,13 @@ classdef tfeather < matlab.unittest.TestCase
             fwrite(fileID, [1; 5]);
             fclose(fileID);
             
-            testCase.verifyError(@() featherread(filename), 
'MATLAB:arrow:status:Invalid');
+            testCase.verifyError(@() featherread(filename), 
'arrow:io:feather:FailedToCreateReader');
         end
         
         function ErrorIfInvalidFilenameDatatype(testCase)
-            filename = fullfile(pwd, 'temp.feather');
-            
             t = createTable;
             
             testCase.verifyError(@() featherwrite({table}, t), 
'MATLAB:validation:UnableToConvert');
-            testCase.verifyError(@() featherread({filename}), 
'MATLAB:arrow:InvalidFilenameDatatype');
         end
 
         function ErrorIfTooManyInputs(testCase)
@@ -179,7 +176,7 @@ classdef tfeather < matlab.unittest.TestCase
 
         function ErrorIfTooFewInputs(testCase)
             testCase.verifyError(@() featherwrite(), 'MATLAB:minrhs');
-            testCase.verifyError(@() featherread(), 
'MATLAB:narginchk:notEnoughInputs');
+            testCase.verifyError(@() featherread(), 'MATLAB:minrhs');
         end
         
         function ErrorIfMultiColVarExist(testCase)
@@ -218,5 +215,41 @@ classdef tfeather < matlab.unittest.TestCase
            
             testCase.verifyError(@() featherwrite(filename, actualTable) 
,'arrow:array:ComplexNumeric');
         end
+
+        function SupportedTypes(testCase)
+            filename = fullfile(pwd, 'temp.feather');
+
+            % Create a table with all supported MATLAB types.
+            expectedTable = table(int8   ([1, 2, 3]'), ...
+                                  int16  ([1, 2, 3]'), ...
+                                  int32  ([1, 2, 3]'), ...
+                                  int64  ([1, 2, 3]'), ...
+                                  uint8  ([1, 2, 3]'), ...
+                                  uint16 ([1, 2, 3]'), ...
+                                  uint32 ([1, 2, 3]'), ...
+                                  uint64 ([1, 2, 3]'), ...
+                                  logical([1, 0, 1]'), ...
+                                  single ([1, 2, 3]'), ...
+                                  double ([1, 2, 3]'), ...
+                                  string (["A", "B", "C"]'), ...
+                                  datetime(2023, 6, 28) + days(0:2)');
+
+            actualTable = featherRoundTrip(filename, expectedTable);
+            testCase.verifyEqual(actualTable, expectedTable);
+        end
+
+        function UnicodeVariableNames(testCase)
+            filename = fullfile(pwd, 'temp.feather');
+
+            smiley = "😀";
+            tree =  "🌲";
+            mango = "🥭";
+            columnNames = [smiley, tree, mango];
+            expectedTable = table(1, 2, 3, VariableNames=columnNames);
+
+            actualTable = featherRoundTrip(filename, expectedTable);
+            testCase.verifyEqual(actualTable, expectedTable);
+        end
+
     end
 end
diff --git a/matlab/test/tfeathermex.m b/matlab/test/tfeathermex.m
index c0620e9054..8b6b87aab1 100644
--- a/matlab/test/tfeathermex.m
+++ b/matlab/test/tfeathermex.m
@@ -50,17 +50,17 @@ classdef tfeathermex < matlab.unittest.TestCase
             filename = fullfile(pwd, 'temp.feather');
             
             % Create a table with an invalid MATLAB table variable name.
-            invalidVariable = arrow.util.createVariableStruct('double', 1, 
true, '@');
+            invalidVariable = arrow.util.createVariableStruct('double', 1, 
true, ':');
             validVariable = arrow.util.createVariableStruct('double', 1, true, 
'Valid');
             variables = [invalidVariable, validVariable];
             metadata = arrow.util.createMetadataStruct(1, 2);
             arrow.cpp.call('featherwrite', filename, variables, metadata);
             t = featherread(filename);
             
-            testCase.verifyEqual(t.Properties.VariableNames{1}, 'x_');
+            testCase.verifyEqual(t.Properties.VariableNames{1}, ':_1');
             testCase.verifyEqual(t.Properties.VariableNames{2}, 'Valid');
             
-            testCase.verifyEqual(t.Properties.VariableDescriptions{1}, 
'Original variable name: ''@''');
+            testCase.verifyEqual(t.Properties.VariableDescriptions{1}, 
'Original variable name: '':''');
             testCase.verifyEqual(t.Properties.VariableDescriptions{2}, '');
         end 
     end

Reply via email to