patricker opened a new issue, #26248:
URL: https://github.com/apache/airflow/issues/26248

   ### Apache Airflow Provider(s)
   
   google
   
   ### Versions of Apache Airflow Providers
   
   8.3.0
   
   ### Apache Airflow version
   
   2.3.4
   
   ### Operating System
   
   OSX
   
   ### Deployment
   
   Virtualenv installation
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   `PostgresToGCSOperator` was being used to query a table containing 
timestamps and JSON data.
   `export_format='parquet'` was set.
   The export to Parquet fails.
   
   For the Timestamp, the error is:
   
   ```
     File 
"/apache-airflow/airflow/providers/google/cloud/transfers/sql_to_gcs.py", line 
154, in execute
       for file_to_upload in self._write_local_data_files(cursor):
     File 
"/apache-airflow/airflow/providers/google/cloud/transfers/sql_to_gcs.py", line 
241, in _write_local_data_files
       tbl = pa.Table.from_pydict(row_pydic, parquet_schema)
     File "pyarrow/table.pxi", line 1724, in pyarrow.lib.Table.from_pydict
     File "pyarrow/table.pxi", line 2385, in pyarrow.lib._from_pydict
     File "pyarrow/array.pxi", line 341, in pyarrow.lib.asarray
     File "pyarrow/array.pxi", line 315, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
     File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted 
to int
   ```
   
   For the JSON, the error is the same, except:
   
   ```
   pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object
   ```
   
   
   ### What you think should happen instead
   
   Both timestamp and JSON datatypes should be supported for export to Parquet
   
   ### How to reproduce
   
   As long as the export format is set to parquet, then selecting any timestamp 
or JSON object reproduces the issue.
   
   ### Anything else
   
   While troubleshooting this, I believe the fix has to be added to every 
subclass of `BaseSQLToGCSOperator`. In the `convert_type` method, 
timestamps/dates/times are all converted to strings. This is good for CSV and 
JSON, but is causing issues with Parquet. Instead the types should be returned 
as-is.
   
   As for JSON, probably just set `stringify_dict=True` for Parquet parsing is 
my guess.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to