JorgenG commented on issue #27345:
URL: https://github.com/apache/airflow/issues/27345#issuecomment-1295204855

   Hey @potiuk, thanks a lot for responding. I realise my vocabulary for 
Airflow logs might be a bit inaccurate. 
   
   I'll try to be more specific about what I observe. First of all, I am not 
100% what you mean about about parsing of top-level code in the DAG. 
   
   Essentially what we observe is that log lines _inside a single task 
execution_ appears to be duplicated.
   
   Below is an excerpt from a task that spins up an ECS container. 
   
   ```
   *** Reading remote log from Cloudwatch log_group: 
/stacks/intranet/airflow-tasks log_stream: 
dag_id=dbt-events-incremental/run_id=scheduled__2022-10-28T15_15_00+00_00/task_id=dbt_events_incremental/attempt=1.log.
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: 
dbt-events-incremental.dbt_events_incremental 
scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: 
dbt-events-incremental.dbt_events_incremental 
scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: 
dbt-events-incremental.dbt_events_incremental 
scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] Dependencies all met for <TaskInstance: 
dbt-events-incremental.dbt_events_incremental 
scheduled__2022-10-28T15:15:00+00:00 [queued]>
   [2022-10-28, 15:21:36 UTC] 
   
--------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] 
   
--------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] Starting attempt 1 of 1
   [2022-10-28, 15:21:36 UTC] Starting attempt 1 of 1
   [2022-10-28, 15:21:36 UTC] 
   
--------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] 
   
--------------------------------------------------------------------------------
   [2022-10-28, 15:21:36 UTC] Executing <Task(DbtOperatorV2): 
dbt_events_incremental> on 2022-10-28 15:15:00+00:00
   [2022-10-28, 15:21:36 UTC] Executing <Task(DbtOperatorV2): 
dbt_events_incremental> on 2022-10-28 15:15:00+00:00
   ...
   [2022-10-28, 15:21:37 UTC] Starting ECS Task Log Fetcher
   [2022-10-28, 15:21:37 UTC] Starting ECS Task Log Fetcher
   [2022-10-28, 15:21:37 UTC] 
/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/amazon/aws/operators/ecs.py:437:
 DeprecationWarning: This class is deprecated. Please use 
`airflow.providers.amazon.aws.hooks.ecs.EcsTaskLogFetcher`.
     self.task_log_fetcher = self._get_task_log_fetcher()
   
   [2022-10-28, 15:22:07 UTC] Using connection ID 'aws_default' for task 
execution.
   [2022-10-28, 15:22:38 UTC] [2022-10-28, 15:22:15 UTC] + set -o pipefail
   [2022-10-28, 15:22:38 UTC] [2022-10-28, 15:22:15 UTC] + set -o pipefail
   ...
   [2022-10-28, 15:54:54 UTC] ECS Task has been successfully executed
   [2022-10-28, 15:54:54 UTC] ECS Task has been successfully executed
   [2022-10-28, 15:54:54 UTC] Marking task as SUCCESS. 
dag_id=dbt-events-incremental, task_id=dbt_events_incremental, 
execution_date=20221028T151500, start_date=20221028T152136, 
end_date=20221028T155454
   [2022-10-28, 15:54:54 UTC] Marking task as SUCCESS. 
dag_id=dbt-events-incremental, task_id=dbt_events_incremental, 
execution_date=20221028T151500, start_date=20221028T152136, 
end_date=20221028T155454
   [2022-10-28, 15:54:54 UTC] Task exited with return code 0
   [2022-10-28, 15:54:54 UTC] 0 downstream tasks scheduled from follow-on 
schedule check
   ```
   
   Upon looking through this in detail, I actually find, as you say, that there 
are lines that are not duplicated. 
   
   I.e.
   
   ```
   [2022-10-28, 15:21:37 UTC] 
/home/airflow/.local/lib/python3.9/site-packages/airflow/providers/amazon/aws/operators/ecs.py:437:
 DeprecationWarning: This class is deprecated. Please use 
`airflow.providers.amazon.aws.hooks.ecs.EcsTaskLogFetcher`.
     self.task_log_fetcher = self._get_task_log_fetcher()
   [2022-10-28, 15:22:07 UTC] Using connection ID 'aws_default' for task 
execution.
   ...
   [2022-10-28, 15:54:54 UTC] Task exited with return code 0
   [2022-10-28, 15:54:54 UTC] 0 downstream tasks scheduled from follow-on 
schedule check
   ```
   
   So based on your comment, I assume these are probably lines from DAG parsing 
then? 
   
   I realise now I was wrong stating that all log lines were duplicated, I 
didn't pay enough attention when looking at it. (Thanks for the hint 🙂 )
   
   To also further elaborate, I don't see duplication of logs in i.e. the 
webserver or scheduled cloudwatch outputs. 
   
   This is in no way critical for us, but I would love to help out further to 
get it fixed. Is there any way we can override some configuration or similar to 
verify that the change you referenced might be the cause? I.e. override 
propagation or similar? 
   
   Best, Jørgen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to