[GitHub] [airflow] arubafna opened a new issue #13898: S3 to GCS to BQ daily incremental data transfer pipeline. Append data from only the incremental (newly added) files from GCS into BQ table?

GitBox Mon, 25 Jan 2021 13:07:01 -0800


arubafna opened a new issue #13898:
URL: https://github.com/apache/airflow/issues/13898



   **Description**
   Set up an automated Airflow pipeline for data transfer from S3 to GCS to BQ. 
   
   **Use case / motivation**
   A major requirement of the solution is to trigger the dag daily, to get 
daily S3 data uploaded into BQ external source partitioned(hive) tables
   
   **Related Issues**
   For the first part, we have used the **"S3ToGoogleCloudStorageOperator"** 
which imports the files from the source s3 bucket into the destination GCS 
bucket. This operator ensures to copy only the newly added (incremental) files. 
In the backend, it returns a list of newly added filenames which is pushed into 
XCOM and which we then pass onto the 
   **"BigQueryInsertJobOperator"** by **pulling the XCOM**.
   This ensures that only the data from the incremental files is appended into 
the BQ tables.
   
    **Issue:**
   This works perfectly fine for most of my S3 buckets, however for some cases 
I receive this error:
   **ERROR - (_mysql_exceptions.DataError) (1406, "Data too long for column 
'value' at row 1"
   [SQL: INSERT INTO xcom (`key`, value, timestamp, execution_date, task_id, 
dag_id) VALUES (%s, %s, %s, %s, %s, %s)**
   Which I realize is because of **XCOM size limitations**. The list of string 
filenames may be exceeding the limit.
   An example of the list returned: 
["partner=doubleverify/sfmt=v1/table=blk/seat=607671/dt=2021-01-15/file=ATT_BLK_ImpID-20210115-07.csv.gz/part-00000-10065608-4a8e-45e3-99df-3f1c7765ed3f-c000.snappy.parquet",
 .....500 more elements ]
   
   ### Is there any way to override the XCOM size limitation? If not then what 
changes in the DAG architecture should be made to make the pipeline scalable 
and ensure that **only the newly added files in GCS are identified to be loaded 
into BQ**?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] arubafna opened a new issue #13898: S3 to GCS to BQ daily incremental data transfer pipeline. Append data from only the incremental (newly added) files from GCS into BQ table?

Reply via email to