[GitHub] [airflow] ashb commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

GitBox Wed, 04 Sep 2019 10:28:01 -0700

ashb commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG 
in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-528002396
 
 
   Kaxil and I did some benchmarking of this with 50 dags, and we noticed a 
couple of things.
   
   First each of our dags was storing 1MB (Yes, Really!) of JSON data in the 
field of the serialized dag. (We need to sanitize the dag we were using a bit 
before testing it)
   
   Secondly something was taking a long time (about 4 seconds per dag) to call 
`Serialization.from_json` which was... slow. So we ran it through py-spy and 
got this flame graph:
   
   ![Screenshot_2019-09-04 
Screenshot](https://user-images.githubusercontent.com/34150/64276885-e21e3d80-cf40-11e9-88ac-fe396f609748.png)
   
   (Calls go from top, to bottom, the wider the bar the more time spent in a 
function or it's children)
   
   This shows us that we spent 43% of the time parsing dates! Oops!
   
   A bit of benchmarking of different parsing/storage methods
   
   ```python
   In [39]: %timeit 
dateutil.parser.parse(json.loads('"2019-09-04T17:57:59.869630+00:00"'))
   158 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   
   In [23]: %timeit 
pendulum.parse(json.loads('"2019-09-04T17:57:59.869630+00:00"'))
   32.6 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   
   In [24]: %timeit 
datetime.datetime.fromtimestamp(json.loads("1567616279.86963"))
   4.91 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
   
   In [25]: %timeit datetime.datetime(*json.loads("[2019, 9, 4, 17, 57, 59, 
869630]"))
   5.47 µs ± 394 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
   
   In [27]: %timeit pendulum.fromtimestamp(json.loads("1567616279.86963"))
   42.6 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   
   In [28]: %timeit pendulum.from_timestamp(json.loads("1567616279.86963"))
   14.5 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   
   In [29]: %timeit pendulum.from_timestamp(json.loads("1567616279.86963"))
   14.5 µs ± 571 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
   
   In [30]: %timeit pendulum.fromtimestamp(json.loads("1567616279.86963"))
   44.4 µs ± 9.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   ```
   
   So storing dates as fractional-seconds since the epoch and using 
`pendulum.from_timestamp` (but not `pendulum.fromtimestamp`!) will cut the 
parsing time from dates by a factor of 10 (158 µs ± 15 µs down to  14.5 µs ± 
0.5 µs) which seems well worth it.
   
   The other thing to look at might be to store fewer dates. For instance I 
think (but haven't checked) that we are storing the start_date against the dag 
and against each and every operator -- we don't need to store it against the 
operators if it is the same value.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [airflow] ashb commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

Reply via email to