Re: [PR] Fix MSSQLToGCSOperator MSSQL BIT data type conversion to Parquet boolean [airflow]

via GitHub Mon, 03 Nov 2025 07:52:03 -0800


juarezr commented on code in PR #57514:
URL: https://github.com/apache/airflow/pull/57514#discussion_r2484935572



##########
providers/google/src/airflow/providers/google/cloud/transfers/mssql_to_gcs.py:
##########
@@ -67,7 +67,7 @@ class MSSQLToGCSOperator(BaseSQLToGCSOperator):
 
     ui_color = "#e0a98c"
 
-    type_map = {2: "BOOLEAN", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
+    type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}

Review Comment:
   ## Testing the fix for bit_fields
   
   After your advice, I have tested exporting a simple query to the formats: 
PARQUET, CSV, and JSON.
   
   I also looked into the source code to understand what the map used in the 
boolean fields and found that the majority of cases use a map key `BOOL`. 
However, I also found there are some places that are mapping to `BOOLEAN` and 
could present the same issue as I've described. Please see for details below.
   
   ### Test Results
   
   To test the three formats, I've used a simple DAG with two tasks: One task 
to export the data from MSSQL to GCS and another task to load the data into 
BigQuery.
   
   I've run this DAG once for each format. In each run, the data was exported 
correctly and the schema was generated correctly with the correct data type for 
the boolean fields in the parquet format.
   
   Here is the diagram of the data pipeline used to export the data from MSSQL 
to GCS and then to BigQuery:
   
   ```plain
   ┌─────────┐                                      ┌─────────┐                 
                     ┌──────────┐
   │         ││    ┌─────────────────────────┐      │         ││    
┌────────────────────────────┐   │          ││
   │  MSSQL  ││    │                         │      │   GCS   ││    │           
                 │   │ BigQuery ││
   │      ┣──┼─────┼─► MSSQLToGCSOperator ┣─┼──────┼──►   ┣─┼─────┼─► 
GCSToBigQueryOperator ┣─┼───┼──►      ││
   │         ││    │                         │      │         ││    │           
                 │   │          ││
   └─────────┘│    └─────────────────────────┘      └─────────┘│    
└────────────────────────────┘   └──────────┘│
    ──────────┘                                      ──────────┘                
                      ───────────┘
   ```
   
   ### Test Scenario
   
   #### Query
   
   ```sql
   SELECT id, name, isactive, clientid FROM dbo.DriverGroup WHERE id IN (9, 47, 
83)
   ```
   
   #### CSV
   
   This is what was exported to CSV:
   
   ```csv
   id,name,isactive,clientid
   9,OPERACIONAL,True,6947
   47,CLUSTER 13,True,7348
   83,Aero,False,25
   ```
   
   All tasks of this run executed successfully with no need to specify the 
schema for export/import.
   
   #### JSON
   
   This is what was exported to JSON:
   
   ```json
   {"clientid": 6947, "id": 9, "isactive": true, "name": "OPERACIONAL"}
   {"clientid": 7348, "id": 47, "isactive": true, "name": "CLUSTER 13"}
   {"clientid": 25, "id": 83, "isactive": false, "name": "Aero"}
   ```
   
   This run failed while importing the data in BigQuery due to the export are 
not preserving the field order as is stated in the SQL query, which is the same 
field order of the destination table in BigQuery.
   
   But I believe this is unrelated to the issue we've been discussing here. 
Also, this can be corrected with a schema mapping in the 
`GCSToBigQueryOperator` operator.
   
   #### Parquet
   
   Here is a copy/paste representation of the records exported to Parquet:
   
   ```plain
   id name isactive clientid
   9 OPERACIONAL true 6947
   47 CLUSTER 13 true 7348
   83 Aero false 25
   ```
   
   And this is the schema of the exported parquet file:
   
   ```csv
   Column name Data type Nullable Metadata
   id Int64 true {}
   name String true {}
   isactive Bool true {}
   clientid Int64 true {}
   ```
   
   ## Source Code Investigation
   
   ### Operators with BOOL mapping
   
   Looking in the source code, I found a couple of Operators that have the same 
target mapping defined to `BOOL` as the proposed fix. These Operators are:
   
   - PostgresToGCSOperator
   - TrinoToGCSOperator
   - CassandraToGCSOperator
   - PrestoToGCSOperator
   
   I've concluded that the fix is in the right direction, as these Operators 
are using the same target mapping defined to `BOOL` as the proposed fix. But I 
can be wrong in this conclusion.
   
   #### PostgresToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/postgres_to_gcs.py:112
   
   class PostgresToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           1114: "DATETIME",
           1184: "TIMESTAMP",
           1082: "DATE",
           1083: "TIME",
           1005: "INTEGER",
           1007: "INTEGER",
           1016: "INTEGER",
           20: "INTEGER",
           21: "INTEGER",
           23: "INTEGER",
           16: "BOOL", # <-------- MAPS boolean/16 to BOOL ---<
           700: "FLOAT",
           701: "FLOAT",
           1700: "FLOAT",
       }
   ```
   
   #### TrinoToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/trino_to_gcs.py:159
   
   class TrinoToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           "BOOLEAN": "BOOL",# <-------- MAPS BOOLEAN to BOOL ---<
           "TINYINT": "INT64",
           "SMALLINT": "INT64",
           "INTEGER": "INT64",
           "BIGINT": "INT64",
           "REAL": "FLOAT64",
           "DOUBLE": "FLOAT64",
           "DECIMAL": "NUMERIC",
           "VARCHAR": "STRING",
           "CHAR": "STRING",
           "VARBINARY": "BYTES",
           "JSON": "STRING",
           "DATE": "DATE",
           "TIME": "TIME",
           # BigQuery don't time with timezone native.
           "TIME WITH TIME ZONE": "STRING",
           "TIMESTAMP": "TIMESTAMP",
           # BigQuery supports a narrow range of time zones during import.
           # You should use TIMESTAMP function, if you want have TIMESTAMP type
           "TIMESTAMP WITH TIME ZONE": "STRING",
           "IPADDRESS": "STRING",
           "UUID": "STRING",
       }
   ```
   
   #### CassandraToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/cassandra_to_gcs.py:128
   
   class CassandraToGCSOperator(BaseOperator):
   
       CQL_TYPE_MAP = {
           "BytesType": "STRING",
           "DecimalType": "FLOAT",
           "UUIDType": "STRING",
           "BooleanType": "BOOL", # <-------- MAPS BOOLEAN to BOOL ---<
           "ByteType": "INTEGER",
           "AsciiType": "STRING",
           "FloatType": "FLOAT",
           "DoubleType": "FLOAT",
           "LongType": "INTEGER",
           "Int32Type": "INTEGER",
           "IntegerType": "INTEGER",
           "InetAddressType": "STRING",
           "CounterColumnType": "INTEGER",
           "DateType": "TIMESTAMP",
           "SimpleDateType": "DATE",
           "TimestampType": "TIMESTAMP",
           "TimeUUIDType": "STRING",
           "ShortType": "INTEGER",
           "TimeType": "TIME",
           "DurationType": "INTEGER",
           "UTF8Type": "STRING",
           "VarcharType": "STRING",
       }
   ```
   
   #### PrestoToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/presto_to_gcs.py:156
   
   class PrestoToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           "BOOLEAN": "BOOL",# <-------- MAPS BOOLEAN to BOOL ---<
           "TINYINT": "INT64",
           "SMALLINT": "INT64",
           "INTEGER": "INT64",
           "BIGINT": "INT64",
           "REAL": "FLOAT64",
           "DOUBLE": "FLOAT64",
           "DECIMAL": "NUMERIC",
           "VARCHAR": "STRING",
           "CHAR": "STRING",
           "VARBINARY": "BYTES",
           "JSON": "STRING",
           "DATE": "DATE",
           "TIME": "TIME",
           # BigQuery don't time with timezone native.
           "TIME WITH TIME ZONE": "STRING",
           "TIMESTAMP": "TIMESTAMP",
           # BigQuery supports a narrow range of time zones during import.
           # You should use TIMESTAMP function, if you want have TIMESTAMP type
           "TIMESTAMP WITH TIME ZONE": "STRING",
           "IPADDRESS": "STRING",
           "UUID": "STRING",
       }
   ```
   
   ### Operators with BOOLEAN mapping
   
   I've also found a couple of operators with the target mapping defined as 
`BOOLEAN`, unlike this proposed fix.
   
   I believe that there is a potential issue with these Operators. Its possible 
that they may have the same issue as the code before this fix because the type 
names will mismatch.
   
   However, I'm not sure if this is a real issue or not. Also, I don't have a 
readily available environment to test this hypothesis.
   
   #### OracleToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/oracle_to_gcs.py:57
   
   class OracleToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           oracledb.DB_TYPE_BINARY_DOUBLE: "DECIMAL",
           oracledb.DB_TYPE_BINARY_FLOAT: "DECIMAL",
           oracledb.DB_TYPE_BINARY_INTEGER: "INTEGER",
           oracledb.DB_TYPE_BOOLEAN: "BOOLEAN", # <-------- MAPS BOOLEAN to 
BOOLEAN ---<
           oracledb.DB_TYPE_DATE: "TIMESTAMP",
           oracledb.DB_TYPE_NUMBER: "NUMERIC",
           oracledb.DB_TYPE_TIMESTAMP: "TIMESTAMP",
           oracledb.DB_TYPE_TIMESTAMP_LTZ: "TIMESTAMP",
           oracledb.DB_TYPE_TIMESTAMP_TZ: "TIMESTAMP",
       }
   ```
   
   #### MySQLToGCSOperator
   
   Notice that MySQLToGCSOperator maps BIT to an INTEGER data type instead of 
BOOLEAN:
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/mysql_to_gcs.py:63
   
   class MySQLToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           FIELD_TYPE.BIT: "INTEGER", # <-------- MAPS BIT to INTEGER ---<
           FIELD_TYPE.DATETIME: "TIMESTAMP",
           FIELD_TYPE.DATE: "TIMESTAMP",
           FIELD_TYPE.DECIMAL: "FLOAT",
           FIELD_TYPE.NEWDECIMAL: "FLOAT",
           FIELD_TYPE.DOUBLE: "FLOAT",
           FIELD_TYPE.FLOAT: "FLOAT",
           FIELD_TYPE.INT24: "INTEGER",
           FIELD_TYPE.LONG: "INTEGER",
           FIELD_TYPE.LONGLONG: "INTEGER",
           FIELD_TYPE.SHORT: "INTEGER",
           FIELD_TYPE.TIME: "TIME",
           FIELD_TYPE.TIMESTAMP: "TIMESTAMP",
           FIELD_TYPE.TINY: "INTEGER",
           FIELD_TYPE.YEAR: "INTEGER",
       }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix MSSQLToGCSOperator MSSQL BIT data type conversion to Parquet boolean [airflow]

Reply via email to