pauldouane opened a new issue, #55128:
URL: https://github.com/apache/airflow/issues/55128

   ### Description
   
   Add a feature to the DatabricksSqlOperator that allows direct export of 
query results to a Google Cloud Storage (GCS) bucket in Parquet and Avro 
formats.
   
   ### Use case/motivation
   
   The current DatabricksSqlOperator allows executing SQL queries and saving 
the results to a file using the output_path and output_format parameters. 
However, it has a few limitations for common data engineering workflows:
   
   It only supports csv, json, and jsonl formats. Parquet and Avro are widely 
used for their performance and schema handling.
   
   The output is saved to the Databricks cluster's local filesystem (/tmp/), 
not directly to an object storage like GCS. This requires an additional step (a 
separate Airflow task, or in-Databricks logic) to move the file to its final 
destination, adding unnecessary complexity to the DAG.
   
   This feature would streamline simple ETL/ELT pipelines by allowing a single 
Airflow task to:
   
   Execute a SQL query on a Databricks warehouse.
   
   Export the result as a Parquet or Avro file.
   
   Save the file directly to a specified GCS path.
   
   This would eliminate the need for an intermediate COPY command or a separate 
Databricks job (DatabricksSubmitRunOperator) for simple export scenarios, 
simplifying DAGs and improving readability.
   
   Proposed Change:
   
   Introduce new output_format values: Add support for 'parquet' and 'avro' to 
the output_format parameter.
   
   Enhance output_path to support object storage URIs: The output_path 
parameter should be able to accept object storage URIs (e.g., 
gs://my-bucket/path/to/data.parquet).
   
   Implement the export logic: The operator's internal logic would need to be 
updated to handle the conversion of the SQL query results to the specified 
format and stream them directly to the GCS location using the appropriate 
Databricks Spark APIs (e.g., 
spark.read.sql(...).write.format("parquet").save(...)).
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to