ibeauvais opened a new issue #12560:
URL: https://github.com/apache/airflow/issues/12560


   **Apache Airflow version**: 1.10.10
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl 
version`): v1.15.12-gke.20
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Google Cloud Platform 
(Composer 1.12.4 )
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**:
   
   **What happened**:
   On an environment with a lot of dataproc tasks (spark), we have a lot of 
performance issues. 
   After investigation, It seems related to the problem below:
   For all dataproc operators, hooks are initialized in the constructor instead 
of the execute method. The hook initialization results in a significant 
overhead because it accesses the airflow database (get_connection).
   The operator's constructor is executed for each task by the scheduler and 
the workers which induces performance degradation for a large amount of 
dataproc tasks.
   
   Similar problem already fixed in past : #5893 for other GCP operators
   
   The code lead to the issue in dataproc_operator.py, all operator inherit 
from DataprocOperationBaseOperator :
   ```
   
   class DataprocOperationBaseOperator(BaseOperator):
       """The base class for operators that poll on a Dataproc Operation."""
       @apply_defaults
       def __init__(self,
                    project_id,
                    region='global',
                    gcp_conn_id='google_cloud_default',
                    delegate_to=None,
                    *args,
                    **kwargs):
           super(DataprocOperationBaseOperator, self).__init__(*args, **kwargs)
           self.gcp_conn_id = gcp_conn_id
           self.delegate_to = delegate_to
           self.project_id = project_id
           self.region = region
           self.hook = DataProcHook(
               gcp_conn_id=self.gcp_conn_id,
               delegate_to=self.delegate_to,
               api_version='v1beta2'
           )
   ```
   <!-- (please include exact error messages if you can) -->
   
   **What you expected to happen**:
   Dataproc hook should be initialized in execute method
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to