juarezr commented on code in PR #57514:
URL: https://github.com/apache/airflow/pull/57514#discussion_r2484935572


##########
providers/google/src/airflow/providers/google/cloud/transfers/mssql_to_gcs.py:
##########
@@ -67,7 +67,7 @@ class MSSQLToGCSOperator(BaseSQLToGCSOperator):
 
     ui_color = "#e0a98c"
 
-    type_map = {2: "BOOLEAN", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
+    type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}

Review Comment:
   ## Testing results bit_fields
   
   After your advice, I have tested exporting a simple query in the formats 
PARQUET, CSV, and JSON.
   The export process worked fine for all three formats.
   I also looked into the source code and found other places that could present 
the same issue. Please see below for details.
   
   ### Results
   
   Query:
   
   ```sql
   SELECT id, name, isactive, clientid FROM dbo.DriverGroup WHERE id IN (9, 47, 
83)
   ```
   
   This is what was exported to CSV:
   
   ```csv
   id,name,isactive,clientid
   9,OPERACIONAL,True,6947
   47,CLUSTER 13,True,7348
   83,Aero,False,25
   ```
   
   JSON:
   
   ```json
   {"clientid": 6947, "id": 9, "isactive": true, "name": "OPERACIONAL"}
   {"clientid": 7348, "id": 47, "isactive": true, "name": "CLUSTER 13"}
   {"clientid": 25, "id": 83, "isactive": false, "name": "Aero"}
   ```
   
   Here a copy/paste representation of the records exported to Parquet:
   
   ```plain
   id name isactive clientid
   9 OPERACIONAL true 6947
   47 CLUSTER 13 true 7348
   83 Aero false 25
   ```
   
   And this is the schema of the exported parquet file:
   
   ```csv
   Column name  Data type       Nullable        Metadata
   id   Int64   true    {}
   name String  true    {}
   isactive     Bool    true    {}
   clientid     Int64   true    {}
   ```
   
   ## Additional Insights
   
   ### Other Operators with BOOL mapping
   
   Looking more in the source code I found a couple of Operators that already 
have the target mapping defined as `BOOL` like this proposed fix:
   
   #### PostgresToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/postgres_to_gcs.py:112
   
   class PostgresToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           1114: "DATETIME",
           1184: "TIMESTAMP",
           1082: "DATE",
           1083: "TIME",
           1005: "INTEGER",
           1007: "INTEGER",
           1016: "INTEGER",
           20: "INTEGER",
           21: "INTEGER",
           23: "INTEGER",
           16: "BOOL", # <-------- MAPS boolean/16 to BOOL ---<
           700: "FLOAT",
           701: "FLOAT",
           1700: "FLOAT",
       }
   
   ```
   
   #### TrinoToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/trino_to_gcs.py:159
   
   class TrinoToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           "BOOLEAN": "BOOL",# <-------- MAPS BOOLEAN to BOOL ---<
           "TINYINT": "INT64",
           "SMALLINT": "INT64",
           "INTEGER": "INT64",
           "BIGINT": "INT64",
           "REAL": "FLOAT64",
           "DOUBLE": "FLOAT64",
           "DECIMAL": "NUMERIC",
           "VARCHAR": "STRING",
           "CHAR": "STRING",
           "VARBINARY": "BYTES",
           "JSON": "STRING",
           "DATE": "DATE",
           "TIME": "TIME",
           # BigQuery don't time with timezone native.
           "TIME WITH TIME ZONE": "STRING",
           "TIMESTAMP": "TIMESTAMP",
           # BigQuery supports a narrow range of time zones during import.
           # You should use TIMESTAMP function, if you want have TIMESTAMP type
           "TIMESTAMP WITH TIME ZONE": "STRING",
           "IPADDRESS": "STRING",
           "UUID": "STRING",
       }
   
   ```
   
   #### CassandraToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/cassandra_to_gcs.py:128
   
   class CassandraToGCSOperator(BaseOperator):
   
       CQL_TYPE_MAP = {
           "BytesType": "STRING",
           "DecimalType": "FLOAT",
           "UUIDType": "STRING",
           "BooleanType": "BOOL", # <-------- MAPS BOOLEAN to BOOL ---<
           "ByteType": "INTEGER",
           "AsciiType": "STRING",
           "FloatType": "FLOAT",
           "DoubleType": "FLOAT",
           "LongType": "INTEGER",
           "Int32Type": "INTEGER",
           "IntegerType": "INTEGER",
           "InetAddressType": "STRING",
           "CounterColumnType": "INTEGER",
           "DateType": "TIMESTAMP",
           "SimpleDateType": "DATE",
           "TimestampType": "TIMESTAMP",
           "TimeUUIDType": "STRING",
           "ShortType": "INTEGER",
           "TimeType": "TIME",
           "DurationType": "INTEGER",
           "UTF8Type": "STRING",
           "VarcharType": "STRING",
       }
   
   ```
   
   #### PrestoToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/presto_to_gcs.py:156
   
   class PrestoToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           "BOOLEAN": "BOOL",# <-------- MAPS BOOLEAN to BOOL ---<
           "TINYINT": "INT64",
           "SMALLINT": "INT64",
           "INTEGER": "INT64",
           "BIGINT": "INT64",
           "REAL": "FLOAT64",
           "DOUBLE": "FLOAT64",
           "DECIMAL": "NUMERIC",
           "VARCHAR": "STRING",
           "CHAR": "STRING",
           "VARBINARY": "BYTES",
           "JSON": "STRING",
           "DATE": "DATE",
           "TIME": "TIME",
           # BigQuery don't time with timezone native.
           "TIME WITH TIME ZONE": "STRING",
           "TIMESTAMP": "TIMESTAMP",
           # BigQuery supports a narrow range of time zones during import.
           # You should use TIMESTAMP function, if you want have TIMESTAMP type
           "TIMESTAMP WITH TIME ZONE": "STRING",
           "IPADDRESS": "STRING",
           "UUID": "STRING",
       }
   
   ```
   
   ### Other Operators with BOOLEAN mapping
   
   I've also found a couple Operator with the target mapping defined as `BOOL` 
unlike this proposed fix.
   
   It is possible that the suffer the same issue as the code before this fix 
because the type names will mismatch.
   
   #### OracleToGCSOperator
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/oracle_to_gcs.py:57
   
   class OracleToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           oracledb.DB_TYPE_BINARY_DOUBLE: "DECIMAL",
           oracledb.DB_TYPE_BINARY_FLOAT: "DECIMAL",
           oracledb.DB_TYPE_BINARY_INTEGER: "INTEGER",
           oracledb.DB_TYPE_BOOLEAN: "BOOLEAN", # <-------- MAPS BOOLEAN to 
BOOLEAN ---<
           oracledb.DB_TYPE_DATE: "TIMESTAMP",
           oracledb.DB_TYPE_NUMBER: "NUMERIC",
           oracledb.DB_TYPE_TIMESTAMP: "TIMESTAMP",
           oracledb.DB_TYPE_TIMESTAMP_LTZ: "TIMESTAMP",
           oracledb.DB_TYPE_TIMESTAMP_TZ: "TIMESTAMP",
       }
   
   ```
   
   #### MySQLToGCSOperator
   
   Notice that MySQLToGCSOperator maps BIT to a INTEGER data type intead of 
BOOLEAN:
   
   ```python
   # 
providers/google/src/airflow/providers/google/cloud/transfers/mysql_to_gcs.py:63
   
   class MySQLToGCSOperator(BaseSQLToGCSOperator):
   
       type_map = {
           FIELD_TYPE.BIT: "INTEGER", # <-------- MAPS BIT to INTEGER ---<
           FIELD_TYPE.DATETIME: "TIMESTAMP",
           FIELD_TYPE.DATE: "TIMESTAMP",
           FIELD_TYPE.DECIMAL: "FLOAT",
           FIELD_TYPE.NEWDECIMAL: "FLOAT",
           FIELD_TYPE.DOUBLE: "FLOAT",
           FIELD_TYPE.FLOAT: "FLOAT",
           FIELD_TYPE.INT24: "INTEGER",
           FIELD_TYPE.LONG: "INTEGER",
           FIELD_TYPE.LONGLONG: "INTEGER",
           FIELD_TYPE.SHORT: "INTEGER",
           FIELD_TYPE.TIME: "TIME",
           FIELD_TYPE.TIMESTAMP: "TIMESTAMP",
           FIELD_TYPE.TINY: "INTEGER",
           FIELD_TYPE.YEAR: "INTEGER",
       }
   
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to