[GitHub] [superset] Carla6-7 opened a new issue #18159: ## Preamble

GitBox Mon, 24 Jan 2022 21:51:33 -0800


Carla6-7 opened a new issue #18159:
URL: https://github.com/apache/superset/issues/18159



   ## Preamble
   
   This is the third part of my serie of design documentation on refactoring 
Kedro to make deployment easier:
   
   - The first part is described in the issue #770 and focuses on refactoring 
configuration to separate external and applicative configuration.
   - The second part is described in the issue #904 focuses on ``DataCatalog`` 
entries which have a compute/storage backend different than "python / in memory 
operations" (including SQl, Spark...). 
   - This third part focuses on the ability to modify the running logic at 
runtime, outside of the ``KedroSession``.
   
   ## Defining the feature: Modifying the running logic and distribute the 
modifier
   
   ### Current state of Kedro's extensibility
   
   There are currently several ways to extend Kedro natively, described 
hereafter:
   
   |What is extended|Use cases example|Kedro object|Registration|Popularity|
   |---|---|---|---|---|
   |Pipeline execution at runtime|-  change catalog entries on the fly (cache 
data, change git branch... <br /> - log data remotely (mlflow, neptune, dolt, 
store kedro-viz static files...) |Hooks (Pipeline, node)| - via an entrypoint 
<br /> -  OR manual declaration in settings.py|High: [a quick github 
search]((https://github.com/search?p=1&q=before_pipeline_run&type=Code)) shows 
that many users use hooks to add custom logic at runtime|
   |CLI command|- create a configuration file <br /> - profile a catalog entry 
<br /> - convert a kedro pipeline to an orchestrator <br /> - visualize the 
pipeline in a webrowser…|plugin click commands|- via an entrypoint|Medium: Seem 
to be a more adavced use mainly for plugin developpers|
   |Data sources connection|- Create a dataset which can connect to a new data 
source unsupported by kedro (GBQ, HDF, sklearn pipelines, Databrics, Stata, 
redis,…. are the most recent ones)|AbstractDataSet|As a module, which can be 
imported by its path in the DataCatalog|High: [a quick search in Kedro's past 
issues](https://github.com/quantumblacklabs/kedro/issues?q=is%3Aissue+dataset) 
shows that it is very common request for users who need to connect to specific 
data sources|
   
   ### Use cases not covered by previous mechanisms
   
   However, I've encountered a bunch of use case where people want to extend 
the **running logic** (=how to run the pipeline) rather than of the execution 
logic (=how the pipeline behaves during runtime, which is achieved by hooks). 
Some examples includes: 
   1. Running the entire pipeline several times (e.g. with different set of 
parameters for hyperparameters tuning 
(https://github.com/quantumblacklabs/kedro/issues/282#issuecomment-768111744, 
https://github.com/quantumblacklabs/kedro/discussions/948,https://github.com/Galileo-Galilei/kedro-mlflow/issues/246))
   2. Prepare a conda environment in a different pid before running the 
pipeline to ensure environment consistency (this is very similar to what 
"mlflow projects" do)
   3. Perfoms "CI-like checks" (lint...) before running the pipeline, 
especially when you launch a very long pipeline (this is very similar to what 
"mlflow projects" do)
   4. Force commiting unstaged changes to ensure reproducibility (this is very 
similar to what "mlflow projects" do)
   5. Once the pipeline has finished running, expose it as an API (this could 
be a conveinent way to serve a Kedro Pipeline)
   6. If we offer the community the ability to distribute such changes, I'm 
pretty sure other use cases will arise 😃 
   
   These are **real life use-cases which cannot be achieved by hooks because we 
want to perform operations outside of a ``KedroSession``**.
   
   ### Current workaround pros and cons analysis 
   
   Actually, I can think of two ways to achieve previous use cases in Kedro:
   - override the `cli.py:run` commmand at the project level (or in a plugin) 
with custom logic
   - create a custom runner inheriting from ``AbstractRunner`` which contains 
the execution logic and manually inject it in your ``cli.py`` at the project 
level.
   
   These solutions have strong issues: 
   - **lack of composability**: if you want to compose 2 logic you cannot just 
import the ``run`` from another project or plugin, you have to recode 
everything at the project level. At least the ``runner`` solution enable to 
compose logics through inheritance, but it is not easy to maintain.
   - **difficulty of distribution**: if you create a run command in a plugin, 
you can ``pip install`` it and benefits from the new logic; howewever you have 
to give up the possibility to extend your own cli at the project level; even 
worse, plugin order import can lead to inconsistent behaviour if several 
plugins implements a run command.
   - **difficulty of maintenance**: since it is hard to know which ``run`` 
command is running in case of concurrrent overriding of the command, it can 
obfuscate a lot running errors.
   - **lack of flexibility**: You can have a single running logic in your 
project, while you often need to switch between kedro's default ``run`` command 
and the custom one (e.g. you want to run your pipeline normally most of the 
time while developping, and have another logic sometimes (e.g. one of the ones 
described above).
   
   The best workflow I could came up with to implement such "running logic" 
changes is the following: 
   - Create a custom ``AbstractRunner``
   - Modify the ``cli.py`` on a per project basis to use my custom runner
   - Create several different very similar commands (run, run_serve, 
run_pre_conda...) with duplicated code to run the session, each one with a 
different running logic, so I can pick up the one I want when running `kedro 
run`.
   
   So I can at least reuse my custom ``runner`` in other projects by importing 
them and modifying the other project ``cli.py``, which is not very convenient. 
   
   ## Potential solutions:
   
   ### A short term solution: Injecting the ``runner`` class at runtime 
   
   Actually, kedro seems to have all the important ``elementary bricks`` to 
create custom running logic and choose it at runtime: the ``run`` command and 
the ``AbstractRunner`` class. 
   
   The main default is that we can't easility distribute this logic to other 
users. I suggest to modify the default `run` command to be able to flexibly 
specify the runner at runtime with a similar logic as custom ``DataSet`` in the 
``DataCatalog`` by specifying its path. 
   
   
https://github.com/quantumblacklabs/kedro/blob/c2c984a260132cdb9c434099485eae05707ad116/kedro/framework/cli/project.py#L351-L392
   
   ```diff
   def run(
       tag,
       env,
       parallel,
       runner,
       is_async,
       node_names,
       to_nodes,
       from_nodes,
       from_inputs,
       to_outputs,
       load_version,
       pipeline,
       config,
       params,
   ):
       """Run the pipeline."""
       if parallel and runner:
           raise KedroCliError(
               "Both --parallel and --runner options cannot be used together. "
               "Please use either --parallel or --runner."
           )
       runner = runner or "SequentialRunner"
       if parallel:
           runner = "ParallelRunner"
   
   +   runner_prefix = "kedro.runner" if runner in {"SequentialRunner", 
"ParallelRunner", "ThreadRunner"} else ""
   +   runner_class = load_obj(runner, runner_prefix) # eventually "import 
settings" and load runner configuration from a config file to enable 
parameterization?
   -   runner_class = load_obj(runner, "kedro.runner")  
        
       tag = _get_values_as_tuple(tag) if tag else tag
       node_names = _get_values_as_tuple(node_names) if node_names else 
node_names
   
        
       with KedroSession.create(env=env, extra_params=params) as session:
           session.run(
               tags=tag,
               runner=runner_class(is_async=is_async),
               node_names=node_names,
               from_nodes=from_nodes,
               to_nodes=to_nodes,
               from_inputs=from_inputs,
               to_outputs=to_outputs,
               load_versions=load_version,
               pipeline_name=pipeline,
           )
   ```
   
   **Advantages for kedro users:** 
   - This would enable to **use the same** command to inject my running logic 
at runtime, e.g.: 
   ```bash
   kedro run --pipeline=my-pipeline # normal use
   kedro run --pipeline=my-pipeline --runner=kedro_mlflow.runner.MlflowRunner # 
use mlflow projects to create a conda env, clean git history, performs check 
before running 
   kedro run --pipeline=my-pipeline --runner=ServiceRunner # Serve my pipeline 
after running 
   ```
   - it would make **transition to production very easy if you want to have 
different logic** for e.g. serving the model or processing a batch.
   - this implementation is **completely backward-compatible** with kedro's 
running logic and completely straightforward to add to the codebase.
   - the logic is **very easy to distribute**: anyone can use my custome runner 
just with module path.
   
   ### Towards more flexibility: configure runners in a configuration file
   
   The previous solution does not enable to inject additional parameters to the 
runner. It "feels" currently poorly managed (there are "if condition" inside 
the run command  to check wether a parameter can be used with the given runner 
or not...). A solution could be to have a ``runner.yml`` file behaving in a 
catalog-like way to enable parametrization. it would also enable to use the 
same runner with different parameters. Such a file could look like this:
   
   ```
   #runner.yml
   
   my_parallel_runner_async:
       type: ParallelRunner
       is_async: True
   
   my_service_runner:
       type: nice_plugin.runner.ServiceRunner
       host: 127.0.0.1
        port: 5000
        
   my_service_runner2:
       type: nice_plugin.runner.ServiceRunner
       host: 127.0.0.1
        port: 5001
        
   ``` 
   
   And the ``run`` command could resolve a name in this ``RunnerCatalog`` and 
use it in the following fashion: 
   
   ```bash
   kedro run --pipeline=my_pipeline --runner=my_service_runner2
   ```
   
   __Originally posted by @Galileo-Galilei in 
https://github.com/kedro-org/kedro/issues/1041__


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

[GitHub] [superset] Carla6-7 opened a new issue #18159: ## Preamble

Reply via email to