randuhmm opened a new issue #9977: URL: https://github.com/apache/airflow/issues/9977
I noticed that the Qubole operator strips certain parameters from the constructor that it wants to pass to the various commands. Some of these parameters collide with the BaseOperator parameters. You can see the parameters that are used by the various Qubole commands at this location: https://github.com/apache/airflow/blob/master/airflow/providers/qubole/operators/qubole.py#L202 This snippet flattens and combines several lists, here are the contents of the various lists: ``` from airflow.contrib.operators.qubole_operator import * print("COMMAND_ARGS\n", COMMAND_ARGS.values()) print("HYPHEN_ARGS\n", HYPHEN_ARGS) print("POSITIONAL_ARGS\n", POSITIONAL_ARGS.values()) print("QuboleOperator.qubole_hook_allowed_args_list\n", QuboleOperator.qubole_hook_allowed_args_list) ``` Here's the output: ``` COMMAND_ARGS dict_values([['query', 'script_location', 'macros', 'tags', 'sample_size', 'cluster_label', 'notify', 'name', 'pool', 'hive_version', 'retry'], ['query', 'script_location', 'macros', 'tags', 'cluster_label', 'notify', 'name', 'retry'], ['cluster_label', 'notify', 'name', 'pool', 'tags', 'retry', 'sub_command'], ['script', 'script_location', 'files', 'archives', 'cluster_label', 'notify', 'tags', 'name', 'pool', 'parameters'], ['script', 'script_location', 'cluster_label', 'notify', 'tags', 'name', 'pool', 'retry', 'parameters'], ['program', 'cmdline', 'sql', 'note_id', 'script_location', 'macros', 'tags', 'cluster_label', 'language', 'app_id', 'notify', 'name', 'pool', 'arguments', 'user_program_arguments', 'retry'], ['db_tap_id', 'query', 'notify', 'script_location', 'macros', 'tags', 'name'], ['mode', 'schema', 'hive_table', 'partition_spec', 'dbtap_id', 'db_table', 'use_customer_cluster', 'customer_cluster_label', 'db_update_mode', 'db_update_keys', 'export_dir', 'fields_termi nated_by', 'notify', 'tags', 'name', 'additional_options', 'retry'], ['mode', 'schema', 'hive_table', 'hive_serde', 'dbtap_id', 'db_table', 'use_customer_cluster', 'customer_cluster_label', 'where_clause', 'parallelism', 'extract_query', 'boundary_query', 'split_column', 'notify', 'tags', 'name', 'additional_options', 'retry', 'partition_spec'], ['query', 'script_location', 'macros', 'tags', 'sample_size', 'cluster_label', 'notify', 'name']]) HYPHEN_ARGS ['app_id', 'hive_version', 'cluster_label', 'note_id'] POSITIONAL_ARGS dict_values([['sub_command'], ['parameters'], ['parameters']]) QuboleOperator.qubole_hook_allowed_args_list ['command_type', 'qubole_conn_id', 'fetch_logs'] ``` All of the above parameter names are filtered out of the constructor. If we gather all the unique parameter names, we can identify the collisions with the BaseOperator: ``` from airflow.contrib.operators.qubole_operator import * from airflow.models.baseoperator import BaseOperator from inspect import signature qubole_args = set(flatten_list(COMMAND_ARGS.values()) + HYPHEN_ARGS + \ flatten_list(POSITIONAL_ARGS.values()) + QuboleOperator.qubole_hook_allowed_args_list) base_op_args = set(signature(BaseOperator.__init__).parameters.keys()) print("colliding args = ", qubole_args.intersection(base_op_args)) ``` The output is: ``` colliding args {'pool'} ``` I am testing this against an older version of Airflow (v 1.10.6) so I think this exercise might need to be done with additional versions of Airflow as well as the `qds_sdk` to see what other parameters are colliding. I think the same issue is at the root of this related issue: https://github.com/apache/airflow/issues/9347 I think that a general fix that isolates all of the params passed to the Qubole commands so there is no possible way for them to collide with the BaseOperator params. Param names like `pool` and `tags` are all too generic and should be isolated (along with with all of the other qubole command params) in a separate namespace so that these bugs can be completely avoided in the future. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
