juarezr commented on code in PR #57514:
URL: https://github.com/apache/airflow/pull/57514#discussion_r2484935572
##########
providers/google/src/airflow/providers/google/cloud/transfers/mssql_to_gcs.py:
##########
@@ -67,7 +67,7 @@ class MSSQLToGCSOperator(BaseSQLToGCSOperator):
ui_color = "#e0a98c"
- type_map = {2: "BOOLEAN", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
+ type_map = {2: "BOOL", 3: "INTEGER", 4: "TIMESTAMP", 5: "NUMERIC"}
Review Comment:
## Testing results bit_fields
After your advice, I have tested exporting a simple query in the formats
PARQUET, CSV, and JSON.
The export process worked fine for all three formats.
I also looked into the source code and found other places that could present
the same issue. Please see below for details.
### Results
Query:
```sql
SELECT id, name, isactive, clientid FROM dbo.DriverGroup WHERE id IN (9, 47,
83)
```
This is what was exported to CSV:
```csv
id,name,isactive,clientid
9,OPERACIONAL,True,6947
47,CLUSTER 13,True,7348
83,Aero,False,25
```
JSON:
```json
{"clientid": 6947, "id": 9, "isactive": true, "name": "OPERACIONAL"}
{"clientid": 7348, "id": 47, "isactive": true, "name": "CLUSTER 13"}
{"clientid": 25, "id": 83, "isactive": false, "name": "Aero"}
```
Here a copy/paste representation of the records exported to Parquet:
```plain
id name isactive clientid
9 OPERACIONAL true 6947
47 CLUSTER 13 true 7348
83 Aero false 25
```
And this is the schema of the exported parquet file:
```csv
Column name Data type Nullable Metadata
id Int64 true {}
name String true {}
isactive Bool true {}
clientid Int64 true {}
```
## Additional Insights
### Other Operators with BOOL mapping
Looking more in the source code I found a couple of Operators that already
have the target mapping defined as `BOOL` like this proposed fix:
#### PostgresToGCSOperator
```python
#
providers/google/src/airflow/providers/google/cloud/transfers/postgres_to_gcs.py:112
class PostgresToGCSOperator(BaseSQLToGCSOperator):
type_map = {
1114: "DATETIME",
1184: "TIMESTAMP",
1082: "DATE",
1083: "TIME",
1005: "INTEGER",
1007: "INTEGER",
1016: "INTEGER",
20: "INTEGER",
21: "INTEGER",
23: "INTEGER",
16: "BOOL", # <-------- MAPS boolean/16 to BOOL ---<
700: "FLOAT",
701: "FLOAT",
1700: "FLOAT",
}
```
#### TrinoToGCSOperator
```python
#
providers/google/src/airflow/providers/google/cloud/transfers/trino_to_gcs.py:159
class TrinoToGCSOperator(BaseSQLToGCSOperator):
type_map = {
"BOOLEAN": "BOOL",# <-------- MAPS BOOLEAN to BOOL ---<
"TINYINT": "INT64",
"SMALLINT": "INT64",
"INTEGER": "INT64",
"BIGINT": "INT64",
"REAL": "FLOAT64",
"DOUBLE": "FLOAT64",
"DECIMAL": "NUMERIC",
"VARCHAR": "STRING",
"CHAR": "STRING",
"VARBINARY": "BYTES",
"JSON": "STRING",
"DATE": "DATE",
"TIME": "TIME",
# BigQuery don't time with timezone native.
"TIME WITH TIME ZONE": "STRING",
"TIMESTAMP": "TIMESTAMP",
# BigQuery supports a narrow range of time zones during import.
# You should use TIMESTAMP function, if you want have TIMESTAMP type
"TIMESTAMP WITH TIME ZONE": "STRING",
"IPADDRESS": "STRING",
"UUID": "STRING",
}
```
#### CassandraToGCSOperator
```python
#
providers/google/src/airflow/providers/google/cloud/transfers/cassandra_to_gcs.py:128
class CassandraToGCSOperator(BaseOperator):
CQL_TYPE_MAP = {
"BytesType": "STRING",
"DecimalType": "FLOAT",
"UUIDType": "STRING",
"BooleanType": "BOOL", # <-------- MAPS BOOLEAN to BOOL ---<
"ByteType": "INTEGER",
"AsciiType": "STRING",
"FloatType": "FLOAT",
"DoubleType": "FLOAT",
"LongType": "INTEGER",
"Int32Type": "INTEGER",
"IntegerType": "INTEGER",
"InetAddressType": "STRING",
"CounterColumnType": "INTEGER",
"DateType": "TIMESTAMP",
"SimpleDateType": "DATE",
"TimestampType": "TIMESTAMP",
"TimeUUIDType": "STRING",
"ShortType": "INTEGER",
"TimeType": "TIME",
"DurationType": "INTEGER",
"UTF8Type": "STRING",
"VarcharType": "STRING",
}
```
#### PrestoToGCSOperator
```python
#
providers/google/src/airflow/providers/google/cloud/transfers/presto_to_gcs.py:156
class PrestoToGCSOperator(BaseSQLToGCSOperator):
type_map = {
"BOOLEAN": "BOOL",# <-------- MAPS BOOLEAN to BOOL ---<
"TINYINT": "INT64",
"SMALLINT": "INT64",
"INTEGER": "INT64",
"BIGINT": "INT64",
"REAL": "FLOAT64",
"DOUBLE": "FLOAT64",
"DECIMAL": "NUMERIC",
"VARCHAR": "STRING",
"CHAR": "STRING",
"VARBINARY": "BYTES",
"JSON": "STRING",
"DATE": "DATE",
"TIME": "TIME",
# BigQuery don't time with timezone native.
"TIME WITH TIME ZONE": "STRING",
"TIMESTAMP": "TIMESTAMP",
# BigQuery supports a narrow range of time zones during import.
# You should use TIMESTAMP function, if you want have TIMESTAMP type
"TIMESTAMP WITH TIME ZONE": "STRING",
"IPADDRESS": "STRING",
"UUID": "STRING",
}
```
### Other Operators with BOOLEAN mapping
I've also found a couple Operator with the target mapping defined as
`BOOLEAN`, unlike this proposed fix.
They may have the same issue as the code before this fix because the type
names will mismatch.
#### OracleToGCSOperator
```python
#
providers/google/src/airflow/providers/google/cloud/transfers/oracle_to_gcs.py:57
class OracleToGCSOperator(BaseSQLToGCSOperator):
type_map = {
oracledb.DB_TYPE_BINARY_DOUBLE: "DECIMAL",
oracledb.DB_TYPE_BINARY_FLOAT: "DECIMAL",
oracledb.DB_TYPE_BINARY_INTEGER: "INTEGER",
oracledb.DB_TYPE_BOOLEAN: "BOOLEAN", # <-------- MAPS BOOLEAN to
BOOLEAN ---<
oracledb.DB_TYPE_DATE: "TIMESTAMP",
oracledb.DB_TYPE_NUMBER: "NUMERIC",
oracledb.DB_TYPE_TIMESTAMP: "TIMESTAMP",
oracledb.DB_TYPE_TIMESTAMP_LTZ: "TIMESTAMP",
oracledb.DB_TYPE_TIMESTAMP_TZ: "TIMESTAMP",
}
```
#### MySQLToGCSOperator
Notice that MySQLToGCSOperator maps BIT to a INTEGER data type intead of
BOOLEAN:
```python
#
providers/google/src/airflow/providers/google/cloud/transfers/mysql_to_gcs.py:63
class MySQLToGCSOperator(BaseSQLToGCSOperator):
type_map = {
FIELD_TYPE.BIT: "INTEGER", # <-------- MAPS BIT to INTEGER ---<
FIELD_TYPE.DATETIME: "TIMESTAMP",
FIELD_TYPE.DATE: "TIMESTAMP",
FIELD_TYPE.DECIMAL: "FLOAT",
FIELD_TYPE.NEWDECIMAL: "FLOAT",
FIELD_TYPE.DOUBLE: "FLOAT",
FIELD_TYPE.FLOAT: "FLOAT",
FIELD_TYPE.INT24: "INTEGER",
FIELD_TYPE.LONG: "INTEGER",
FIELD_TYPE.LONGLONG: "INTEGER",
FIELD_TYPE.SHORT: "INTEGER",
FIELD_TYPE.TIME: "TIME",
FIELD_TYPE.TIMESTAMP: "TIMESTAMP",
FIELD_TYPE.TINY: "INTEGER",
FIELD_TYPE.YEAR: "INTEGER",
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]