juarezr opened a new pull request, #57514:
URL: https://github.com/apache/airflow/pull/57514
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.
Feel free to ping committers for the review!
In case of an existing issue, reference it using one of the following:
closes: #57461
related: #29902
How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
<!-- Please keep an empty line above the dashes. -->
---
# Fix MSSQLToGCSOperator MSSQL BIT data type conversion to Parquet boolean
closes #57461
## Summary
Fixes `ArrowTypeError: Expected bytes, got a 'bool' object` when exporting
MSSQL bit fields to Parquet format using `MSSQLToGCSOperator`.
### Issue
- Issue: #57461
- Problem: The `MSSQLToGCSOperator` incorrectly mapped MSSQL bit fields
(type 2) to `"BOOLEAN"` in the `type_map`, but the base class
`BaseSQLToGCSOperator._convert_parquet_schema()` expects `"BOOL"` for boolean
types.
### Root Cause
The `type_map` property in `MSSQLToGCSOperator` had an incorrect type
mapping:
- **Before**: `type_map = {2: "BOOLEAN", ...}`
- **After**: `type_map = {2: "BOOL", ...}`
This mismatch caused PyArrow schema conversion to fail when processing bit
fields in Parquet format exports.
### Impact
- **Affected Users**: Users exporting MSSQL bit fields to Parquet format
using `MSSQLToGCSOperator`
- **Breaking Changes**: None (this is a bug fix)
- **Other Export Formats**: CSV and JSON formats are unaffected (they don't
use this type mapping)
### Related Issues/PRs
closes: #57461
related: #29902 #11874
### Additional Notes
Users can temporarily work around this issue by creating a custom operator
that overrides the `type_map` property:
```python
class FixedMSSQLToGCSOperator(MSSQLToGCSOperator):
type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
```
However, this fix makes the workaround unnecessary.
## Changes Made
### 1. Fixed Type Mapping (`mssql_to_gcs.py`)
- Changed `type_map` from `{2: "BOOLEAN"}` to `{2: "BOOL"}` to match the
expected type key in `BaseSQLToGCSOperator._convert_parquet_schema()`
### 2. Updated Tests (`test_mssql_to_gcs.py`)
- Updated `SCHEMA_JSON` and `SCHEMA_JSON_BIT_FIELDS` constants to use
`"BOOL"` instead of `"BOOLEAN"` to match the fix
- Added new test `test_exec_success_parquet_with_bit_fields()` to verify
that bit fields can be exported to Parquet format without errors
### Files Changed
1.
`providers/google/src/airflow/providers/google/cloud/transfers/mssql_to_gcs.py`
(line 70)
- Changed `type_map = {2: "BOOLEAN", ...}` to `type_map = {2: "BOOL",
...}`
2. `providers/google/tests/unit/google/cloud/transfers/test_mssql_to_gcs.py`
- Updated schema constants to use `"BOOL"` instead of `"BOOLEAN"`
- Added `test_exec_success_parquet_with_bit_fields()` test
## Testing
The fix has been tested and verified:
- ✅ Tested manually with a DAG exporting MSSQL bit fields to Parquet format
- ✅ Unit tests updated to reflect the correct type mapping
- ✅ New test case added to prevent regression
Command to test the changes:
```sh
$ breeze testing providers-tests --skip-db-tests
providers/google/tests/unit/google/cloud/transfers/test_mssql_to_gcs.py
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]