ibeauvais opened a new issue #12560:
URL: https://github.com/apache/airflow/issues/12560
**Apache Airflow version**: 1.10.10
**Kubernetes version (if you are using kubernetes)** (use `kubectl
version`): v1.15.12-gke.20
**Environment**:
- **Cloud provider or hardware configuration**: Google Cloud Platform
(Composer 1.12.4 )
- **OS** (e.g. from /etc/os-release):
- **Kernel** (e.g. `uname -a`):
- **Install tools**:
- **Others**:
**What happened**:
On an environment with a lot of dataproc tasks (spark), we have a lot of
performance issues.
After investigation, It seems related to the problem below:
For all dataproc operators, hooks are initialized in the constructor instead
of the execute method. The hook initialization results in a significant
overhead because it accesses the airflow database (get_connection).
The operator's constructor is executed for each task by the scheduler and
the workers which induces performance degradation for a large amount of
dataproc tasks.
Similar problem already fixed in past : #5893 for other GCP operators
The code lead to the issue in dataproc_operator.py, all operator inherit
from DataprocOperationBaseOperator :
```
class DataprocOperationBaseOperator(BaseOperator):
"""The base class for operators that poll on a Dataproc Operation."""
@apply_defaults
def __init__(self,
project_id,
region='global',
gcp_conn_id='google_cloud_default',
delegate_to=None,
*args,
**kwargs):
super(DataprocOperationBaseOperator, self).__init__(*args, **kwargs)
self.gcp_conn_id = gcp_conn_id
self.delegate_to = delegate_to
self.project_id = project_id
self.region = region
self.hook = DataProcHook(
gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to,
api_version='v1beta2'
)
```
<!-- (please include exact error messages if you can) -->
**What you expected to happen**:
Dataproc hook should be initialized in execute method
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]