potiuk commented on a change in pull request #20190:
URL: https://github.com/apache/airflow/pull/20190#discussion_r767181577
##########
File path: airflow/providers/apache/spark/hooks/spark_sql.py
##########
@@ -83,12 +83,13 @@ def __init__(
yarn_queue: Optional[str] = None,
) -> None:
super().__init__()
+ options: Dict = {}
+ conn: Optional[Connection] = None
try:
- conn: "Optional[Connection]" = self.get_connection(conn_id)
Review comment:
Hm @josh-fell (and @kaxil ) - Something struck, me I just looked closed
at the https://github.com/apache/airflow/pull/18339
I thin while discussing it I made a silent assumption (which I see now was
wrong ) that the connection was created as part of "operator's" initm but this
is about creating it in the Hook's init, which IMHO is quite legitimate use
case (as long as you do not instantiate the Hook in the operator's init(). And
it is pretty common pattern in Airflow (and one we actually encourage).
I double-checked and I looked at the Databricks code and the only place I
could see it being instantiated was _get_hook() and the only place where
_get_hook() is called was inside "execute" and "on_kill" method of the
Databricks operator - so that all sounds pretty legitimate.
Was not the whole issue caused by a misunderstanding of who's init it was? I
think we have maaaaany cases where Hook is created "on-demand" in execute()
method of the operator, also if you decide to create the Hook() inside the
`@task`-decorated functions, it should work really well.
Are my eyes cheating me ? O r maybe I miss something?
##########
File path: airflow/providers/apache/spark/hooks/spark_sql.py
##########
@@ -83,12 +83,13 @@ def __init__(
yarn_queue: Optional[str] = None,
) -> None:
super().__init__()
+ options: Dict = {}
+ conn: Optional[Connection] = None
try:
- conn: "Optional[Connection]" = self.get_connection(conn_id)
Review comment:
Hm @josh-fell (and @kaxil ) - Something struck, me I just looked closed
at the https://github.com/apache/airflow/pull/18339
I think while discussing it I made a silent assumption (which I see now was
wrong ) that the connection was created as part of "operator's" initm but this
is about creating it in the Hook's init, which IMHO is quite legitimate use
case (as long as you do not instantiate the Hook in the operator's init(). And
it is pretty common pattern in Airflow (and one we actually encourage).
I double-checked and I looked at the Databricks code and the only place I
could see it being instantiated was _get_hook() and the only place where
_get_hook() is called was inside "execute" and "on_kill" method of the
Databricks operator - so that all sounds pretty legitimate.
Was not the whole issue caused by a misunderstanding of who's init it was? I
think we have maaaaany cases where Hook is created "on-demand" in execute()
method of the operator, also if you decide to create the Hook() inside the
`@task`-decorated functions, it should work really well.
Are my eyes cheating me ? O r maybe I miss something?
##########
File path: airflow/providers/apache/spark/hooks/spark_sql.py
##########
@@ -83,12 +83,13 @@ def __init__(
yarn_queue: Optional[str] = None,
) -> None:
super().__init__()
+ options: Dict = {}
+ conn: Optional[Connection] = None
try:
- conn: "Optional[Connection]" = self.get_connection(conn_id)
Review comment:
Hm @josh-fell (and @kaxil ) - Something struck, me I just looked closed
at the https://github.com/apache/airflow/pull/18339
I think while discussing it I made a silent assumption (which I see now was
wrong ) that the connection was created as part of "operator's" init but this
is about creating it in the Hook's init, which IMHO is quite legitimate use
case (as long as you do not instantiate the Hook in the operator's init(). And
it is pretty common pattern in Airflow (and one we actually encourage).
I double-checked and I looked at the Databricks code and the only place I
could see it being instantiated was _get_hook() and the only place where
_get_hook() is called was inside "execute" and "on_kill" method of the
Databricks operator - so that all sounds pretty legitimate.
Was not the whole issue caused by a misunderstanding of who's init it was? I
think we have maaaaany cases where Hook is created "on-demand" in execute()
method of the operator, also if you decide to create the Hook() inside the
`@task`-decorated functions, it should work really well.
Are my eyes cheating me ? O r maybe I miss something?
##########
File path: airflow/providers/apache/spark/hooks/spark_sql.py
##########
@@ -83,12 +83,13 @@ def __init__(
yarn_queue: Optional[str] = None,
) -> None:
super().__init__()
+ options: Dict = {}
+ conn: Optional[Connection] = None
try:
- conn: "Optional[Connection]" = self.get_connection(conn_id)
Review comment:
Hm @josh-fell (and @kaxil ) - Something struck, me I just looked closer
at the https://github.com/apache/airflow/pull/18339
I think while discussing it I made a silent assumption (which I see now was
wrong ) that the connection was created as part of "operator's" init but this
is about creating it in the Hook's init, which IMHO is quite legitimate use
case (as long as you do not instantiate the Hook in the operator's init(). And
it is pretty common pattern in Airflow (and one we actually encourage).
I double-checked and I looked at the Databricks code and the only place I
could see it being instantiated was _get_hook() and the only place where
_get_hook() is called was inside "execute" and "on_kill" method of the
Databricks operator - so that all sounds pretty legitimate.
Was not the whole issue caused by a misunderstanding of who's init it was? I
think we have maaaaany cases where Hook is created "on-demand" in execute()
method of the operator, also if you decide to create the Hook() inside the
`@task`-decorated functions, it should work really well.
Are my eyes cheating me ? O r maybe I miss something?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]