This is cool I will take a look shortly (sorry i have an urgent family
matter and i am not available).

I saw an initial version of Mateusz's POC and it completely addressed my
initial reservation about the 'development maintenance' effort - when it
comes to the interface we have for developers and the boilerplate code. We
can easily make it even better that gRPC approach , so I am cool to do it.

The decorator approach which I have not POCed before (but we wanted to do
it) looks cool.

I initially suggested to check pydantic but if it is not a good idea pickle
is good solution (all the bad things about pickle - security and potential
incompatibilities between python versions do not apply in this case (and we
already use it under the hood via Multiprocessing Ina a number of places).

But one other option we could use is to use our own Jason serializer - the
very same we use to serialize DAGs to DB. Since the overhead of JSON vs.
Binary seems to be small, we could do it both way.  I believe Andrew's
comments were also about the wire format being binary. I wonder what others
think ?

Jm

śr., 2 lis 2022, 14:01 użytkownik Mateusz Henc <[email protected]> napisał:

> Hi,
> *JSON RPC API*
> Finally I had some time to work on an Internal API based on JSON RPC.
> I found that this solution can be really elegant and clean from a
> developer perspective.
> You can find the OpenAPi definition there
> <https://github.com/apache/airflow/commit/e86c389a004b26f22953e804e365ffaee200107c#diff-e454d231c105c349be8619b3112c8bfda74f0d45b6f9503e206fc0c8d5f15351>
>  -
> I expects there 3 parameters:
>
>    - jsonrpc - all examples had this, may be important for some
>    versioning?
>    - method_name -(string) name of the method to call
>    - params - parameters for the method.
>
> Storing this file next to the existing Airflow APi allows it to run as
> part of Web Server. Of course it can still be run independently.
>
> To switch a method to make calls to the Internal API I implemented
> internal_api_call
> <https://github.com/apache/airflow/compare/main...mhenc:airflow:poc_44#diff-0f4f6450d07fffead241765439dcae32988ba1aceac81ed54c884505b813e882>
>  decorator.
>
> Then we just add it to the method like
> @internal_api_call("sample_method")
> def sample_method(dag_id: string...):
>
> The decorator will either call the method body or send a request to the
> Internal APi endpoint (depending on the configuration).
> On the API-server side we have a rpc_api handler
> <https://github.com/apache/airflow/commit/eaf3f8e418d8bf00641f09837c4cf9286b1877ab#diff-024141cc0f52a9f8e84b56c9de06b0487b1bc0cd0def4749ac6490816bab3890>
>  which
> contains the methods mapping:
> METHODS = {
>     'update_import_errors': DagFileProcessor.update_import_errors,
>     'process_file': processor.process_file
> }
> Note: `process_file`is just for testing purposes. All methods there
> should be static.
> Rpc endpoint handler then gets metod_name from the request and uses the
> proper function from METHODS. We may add some presubmit to make sure this
> array is properly populated.
>
> *Args/Results serialization*
> The main problem is parameters and results serialization and sending them
> over-the-wire. As this is JSON RPC then serialization to JSON was my first
> idea, however I found that it means we may need to define a custom
> serialization for many methods (example
> <https://github.com/apache/airflow/commit/e86c389a004b26f22953e804e365ffaee200107c#diff-024141cc0f52a9f8e84b56c9de06b0487b1bc0cd0def4749ac6490816bab3890R55>)
> - and not single but 4 of them (params_to_json, params_from_json,
> result_to_json, result_from_json).
>
> On the other side we could just use Pickle - serialize the args
> dictionary, convert to base64 (to send safely inside the JSON) then
> deserialize on the other side.
> This leads to the very elegant solution where all we need to do is change
> the method to static and add a decorator (compare rpc_api handler: pickle
> <https://github.com/apache/airflow/commit/eaf3f8e418d8bf00641f09837c4cf9286b1877ab#diff-024141cc0f52a9f8e84b56c9de06b0487b1bc0cd0def4749ac6490816bab3890>
> vs json
> <https://github.com/apache/airflow/commit/e86c389a004b26f22953e804e365ffaee200107c#diff-024141cc0f52a9f8e84b56c9de06b0487b1bc0cd0def4749ac6490816bab3890>
> )
> *Unless there are strong options not to use pickle, then I would
> recommend this solution.*
>
> There is also yet another idea: use Pydantic, however I found that it
> requires all objects to inherit from its base class - not sure if we should
> do that, but if you believe it would be useful, then I can check if it's
> feasible. The only problem I can see is K8s v1Pod objects that are used in
> multiple places (to be honest I didn't find a good solution to serialize
> and deserialize them properly to/from JSON).
>
> *Tests*
> I ran some tests to compare performance - using Jarek's testing scenarios
> <https://github.com/apache/airflow/pull/25094>:
> Tests were executed on localhost with
> time airflow internal-api test-client --num-repeats 20 --num-callbacks 20
>  --test file_processor
> Results*:*
> - No Internal API : 104 s
> - JSON serialization: 119s
> - Pickle: 115s
>
> I exported my changes to PRs:
> https://github.com/apache/airflow/pull/27468/files
> https://github.com/apache/airflow/pull/27469/files
>
> Any comments appreciated.
>
> Best regards,
> Mateusz Henc
>

Reply via email to