msumit opened a new pull request #17100:
URL: https://github.com/apache/airflow/pull/17100


   Airflow takes a params dictionary at the DAG level or at a Task level that 
can be overridden by providing dag_run.conf values. However, this params 
dictionary is quite static in nature and doesn't provide much value addition. 
   
   There have been quite some requests made by the community on this already, 
like https://github.com/apache/airflow/issues/11054, 
https://github.com/apache/airflow/issues/16430, 
https://github.com/apache/airflow/issues/17085
   
   ## Goal
   
   - Keep the backward compatibility, i.e. the simple params should work as 
they are right now
   - The params should have the notion of a default value, different types 
(int, bool, str etc), and various options to validate the user input.
   - The UI should show proper input controls according to the type of param, 
showing which are must and which are optional, pre-filled with their default 
values if any.
   - It would be good if UI can show the options list or do live pattern 
matches if a param utilizes them.
   - Airflow honors these params even if someone triggers a DAG via CLI or API.
   
   ## Proposal
   
   - We create a new class or set of classes say `Param` which can be used in 
place of the value part of params dictionary.
   - This class should hold the default value and validation rules as well.
   - There should be a method that validates & resolves the value of this Param 
class. The value could be the default one or provided by the user.
   - We should be able to easily serialize or deserialize it out of DB and use 
it in place of a normal params value place.
   - Should work with the standard way of DAG creation as well as with the new 
DAG decorator.
   
   ## Approaches
   
   ### [pydantic](https://pydantic-docs.helpmanual.io/)
   
   Pydantic is one of the fastest Python libraries to provide data & type 
validations ([benchmark](https://pydantic-docs.helpmanual.io/benchmarks/)). I'd 
implemented various params classes in it (see 
[sample](https://gist.github.com/msumit/1a596f3a98f411dae891a42cd13e2812)) but 
did not like the way I had to write validators for each field separately. Also, 
the order you define fields matters a lot how one can access them in those 
validator methods. 
   
   ### [attrs](https://pypi.org/project/attrs/)
   
   Have used attrs previously and it's also in use within Airflow already. 
attrs simplifies writing classes and also exposes various in-build validators & 
pre-post init methods. Using attrs it was quite easy to create these classes 
(see 
[this](https://github.com/astronomer/airflow/blob/dag_params/airflow/models/params.py)),
 though we've to fill in the logic by ourselves to do the data validation. We 
also felt that more & more such data validation requirements would come from 
the users and it could turn into a big pile of code in itself. 
   
   ### [json-schema](https://json-schema.org/understanding-json-schema/)
   
   We are using json-schema for DAG serialization already. json-schema has a 
very powerful & extensive way to define properties (validations) on a field in 
a language-agnostic way. It has implementation libs in almost all major 
[languages](https://json-schema.org/implementations.html). The custom code 
using json-schema is pretty minimum 
([here](https://github.com/astronomer/airflow/blob/simple_params/airflow/models/param.py))
 & provides very extensive validations. 
   
   We should be able to use its Javascript implementation and validate data on 
the UI itself. The only concern here is that the json-schema rules can become 
pretty complex easily & users might found it hard to read and understand.
   
   
   ### Trigger DAG page
   <img width="1056" alt="Screenshot 2021-07-15 at 2 06 36 PM" 
src="https://user-images.githubusercontent.com/2018407/126268835-f833c15a-cf9e-4d67-a242-d734fff43af9.png";>
   <img width="1159" alt="Screenshot 2021-07-15 at 2 08 36 PM" 
src="https://user-images.githubusercontent.com/2018407/126268847-4ea5d70c-70a7-49f6-8a40-36ea03f54dbd.png";>
   
   ### DAG details API
   <img width="354" alt="Screenshot 2021-07-15 at 2 10 18 PM" 
src="https://user-images.githubusercontent.com/2018407/126268850-4c853358-f146-4aa5-b6cc-187aac6180d8.png";>
   
   ### DAG Trigger API
   <img width="1413" alt="Screenshot 2021-07-15 at 2 11 15 PM" 
src="https://user-images.githubusercontent.com/2018407/126268853-5706f1dd-bf87-4d33-a641-9f93d054c635.png";>
   
   ### DAG trigger via CLI
   ```
   $airflow dags trigger example_complex_params --conf '{"str_param": "hello"}'
   
   ValueError: Invalid input for param 'str_param': 'hello' is too long
   
   Failed validating 'maxLength' in schema:
       {'maxLength': 4, 'minLength': 2, 'type': 'string'}
   
   On instance:
       'hello'
   ```
   
   ### Tasks test via CLI
   ```
   $airflow tasks test example_complex_params all_param 2021-07-15T08:43:45 -t 
'{"task_param": true}'
   
   ValueError: True is not of type 'string'
   
   Failed validating 'type' in schema:
       {'type': 'string'}
   
   On instance:
       True    
   ```
   
   Thanks a lot to @ashb & @kaxil for their inputs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to