randuhmm opened a new issue #9977:
URL: https://github.com/apache/airflow/issues/9977


   I noticed that the Qubole operator strips certain parameters from the 
constructor that it wants to pass to the various commands. Some of these 
parameters collide with the BaseOperator parameters. You can see the parameters 
that are used by the various Qubole commands at this location:
   
   
https://github.com/apache/airflow/blob/master/airflow/providers/qubole/operators/qubole.py#L202
   
   This snippet flattens and combines several lists, here are the contents of 
the various lists:
   
   ```
   from airflow.contrib.operators.qubole_operator import *
   
   print("COMMAND_ARGS\n", COMMAND_ARGS.values())
   print("HYPHEN_ARGS\n", HYPHEN_ARGS)
   print("POSITIONAL_ARGS\n", POSITIONAL_ARGS.values())
   print("QuboleOperator.qubole_hook_allowed_args_list\n", 
QuboleOperator.qubole_hook_allowed_args_list)
   ```
   
   Here's the output:
   
   ```
   COMMAND_ARGS
    dict_values([['query', 'script_location', 'macros', 'tags', 'sample_size', 
'cluster_label', 'notify', 'name', 'pool', 'hive_version', 'retry'], ['query', 
'script_location', 'macros', 'tags', 'cluster_label', 'notify', 'name', 
'retry'], ['cluster_label', 'notify', 'name', 'pool', 'tags', 'retry', 
'sub_command'], ['script', 'script_location', 'files', 'archives', 
'cluster_label', 'notify', 'tags', 'name', 'pool', 'parameters'], ['script', 
'script_location', 'cluster_label', 'notify', 'tags', 'name', 'pool', 'retry', 
'parameters'], ['program', 'cmdline', 'sql', 'note_id', 'script_location', 
'macros', 'tags', 'cluster_label', 'language', 'app_id', 'notify', 'name', 
'pool', 'arguments', 'user_program_arguments', 'retry'], ['db_tap_id', 'query', 
'notify', 'script_location', 'macros', 'tags', 'name'], ['mode', 'schema', 
'hive_table', 'partition_spec', 'dbtap_id', 'db_table', 'use_customer_cluster', 
'customer_cluster_label', 'db_update_mode', 'db_update_keys', 'export_dir', 
'fields_termi
 nated_by', 'notify', 'tags', 'name', 'additional_options', 'retry'], ['mode', 
'schema', 'hive_table', 'hive_serde', 'dbtap_id', 'db_table', 
'use_customer_cluster', 'customer_cluster_label', 'where_clause', 
'parallelism', 'extract_query', 'boundary_query', 'split_column', 'notify', 
'tags', 'name', 'additional_options', 'retry', 'partition_spec'], ['query', 
'script_location', 'macros', 'tags', 'sample_size', 'cluster_label', 'notify', 
'name']])
   HYPHEN_ARGS
    ['app_id', 'hive_version', 'cluster_label', 'note_id']
   POSITIONAL_ARGS
    dict_values([['sub_command'], ['parameters'], ['parameters']])
   QuboleOperator.qubole_hook_allowed_args_list
    ['command_type', 'qubole_conn_id', 'fetch_logs']
   ```
   
   All of the above parameter names are filtered out of the constructor. If we 
gather all the unique parameter names, we can identify the collisions with the 
BaseOperator:
   
   ```
   from airflow.contrib.operators.qubole_operator import *
   from airflow.models.baseoperator import BaseOperator
   from inspect import signature
   
   qubole_args = set(flatten_list(COMMAND_ARGS.values()) + HYPHEN_ARGS + \
                     flatten_list(POSITIONAL_ARGS.values()) + 
QuboleOperator.qubole_hook_allowed_args_list)
   base_op_args = set(signature(BaseOperator.__init__).parameters.keys())
   
   print("colliding args = ", qubole_args.intersection(base_op_args))
   ```
   
   The output is:
   
   ```
   colliding args {'pool'}
   ```
   
   I am testing this against an older version of Airflow (v 1.10.6) so I think 
this exercise might need to be done with additional versions of Airflow as well 
as the `qds_sdk` to see what other parameters are colliding. I think the same 
issue is at the root of this related issue:
   
   https://github.com/apache/airflow/issues/9347
   
   I think that a general fix that isolates all of the params passed to the 
Qubole commands so there is no possible way for them to collide with the 
BaseOperator params. Param names like `pool` and `tags` are all too generic and 
should be isolated (along with with all of the other qubole command params) in 
a separate namespace so that these bugs can be completely avoided in the future.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to