Hi folks, thank you for all thw raised questions. It enabled us to investigate this issue. After a while, and testing the webserver locally, we found that the problem was mostly because we have a dag with 54 tasks inside it and that was the major problem when requesting a page. To parse the task, it took 1.5 seconds. Adding this to some connection overhead, for sure we would get a long waiting time.
Sorry for all this alarm. Our fault. But I am glad we learn that. 😁 On Qua, 9 de ago de 2017 16:43 Bolke de Bruin <[email protected]> wrote: > Did you run a tcpdump? Did you test with another web server? Did you try > putting something like Nginx in front that is just better a doing web > serving (we do also to add SSL, and we don't see this issue). Is your > client host name resolvable? (Logging usually reports host names if you > cannot do a reverse dns lookup it will take time to timeout. > > > Verstuurd vanaf mijn iPad > > > Op 9 aug. 2017 om 20:48 heeft Victor Duarte Diniz Monteiro < > [email protected]> het volgende geschreven: > > > > Hi Max, > > > > we have 3693 task instances in the database. > > And we have created an index over start_date for table task_instance as > you > > suggested, but it is still slow. We don't think the problem is in the > > database, because when we run AdHoc Queries, they return fast. > > > > Em qua, 9 de ago de 2017 às 12:47, Maxime Beauchemin < > > [email protected]> escreveu: > > > >> It seems like the default sort should be on start_date, desc, and yes > there > >> should be an index on that. > >> > >> Also 100 per page is probably enough. > >> > >> Can you try [something like] that in your environment and report of > loading > >> times? > >> > >> Also for context, how many task instance do you have total? > >> > >> Max > >> > >> On Tue, Aug 8, 2017 at 12:07 PM, Victor Monteiro < > [email protected]> > >> wrote: > >> > >>> [image: Imagem PNG] > >>> Screen Shot 2017-08-08 at 2.42.55 PM.png > >>> < > https://drive.google.com/a/ubee.in/file/d/0B7u1tjyaPWJQeVVtbFIzcS11eWc/ > >>> view?usp=drivesdk> > >>> > >>> I am sending the image one more time, hosted in google drive. > >>> > https://drive.google.com/a/ubee.in/file/d/0B7u1tjyaPWJQeVVtbFIzcS11eWc/ > >>> view?usp=drivesdk > >>> > >>> Edgar, did you find any solution to speed up webserver? > >>> > >>> Em ter, 8 de ago de 2017 às 16:04, Edgar Rodriguez > >>> <[email protected]> escreveu: > >>> > >>>> I've been profiling the web UI for the last few days and I think I've > >>> been > >>>> able to identify some of the issues. I've seen similar response times > >>> from > >>>> the webserver. > >>>> A couple of things that I found specifically for the task instance > view > >>>> are: > >>>> 1. Page sizes on views are usually too large, and all HTML rendering > is > >>>> done server side, flask_admin introduces some latency rendering the > >>>> templates for 500 TIs at a time in the TaskInstanceModelView, see [ > >>>> AIRFLOW-1483 <https://issues.apache.org/jira/browse/AIRFLOW-1483>] > >>>> 2. Using unindexed column as default for ordering (required for > >> paging), > >>>> triggering a sort on TI requests, e.g. TaskInstanceModelView uses > >>> `job_id` > >>>> as default sort column, but there's no index for that, see > >> [AIRFLOW-1495 > >>>> <https://issues.apache.org/jira/browse/AIRFLOW-1495>] > >>>> > >>>> Cheers, > >>>> Edgar > >>>> > >>>> On Tue, Aug 8, 2017 at 11:56 AM, Victor Monteiro < > >>> [email protected]> > >>>> wrote: > >>>> > >>>>> Sorry, I am sending again. > >>>>> > >>>>> Also, it is always between 6s and 3s. > >>>>> > >>>>> > >>>>> Em ter, 8 de ago de 2017 às 15:21, Ash Berlin-Taylor < > >>>>> [email protected]> escreveu: > >>>>> > >>>>>> (Your screenshot didn't come through for me, possibly because the > >> list > >>>>>> stripped it? That said:) > >>>>>> > >>>>>> Is it always 6 seconds, or after making a few requests, enough so > >> that > >>>>>> each worker stands a chance to have loaded the app any deps does it > >>>> settle > >>>>>> down? > >>>>>> > >>>>>> i.e. the problem might just be that of warm-up. > >>>>>> > >>>>>> -ash > >>>>>>> On 8 Aug 2017, at 18:52, Victor Monteiro <[email protected] > >>> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi everyone. > >>>>>>> > >>>>>>> The problem is very straightforward. When doing a request to > >> airflow > >>>>>> webserver, it is taking too much time to send the first byte. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> As you can see in the picture, it took 6 seconds to send the first > >>>>>> byte. I already investigated the connection with the database and it > >>>> took > >>>>>> 36ms to list all task instances. So, I am starting to think there > >> is a > >>>>>> problem with airflow webserver or my deployment. > >>>>>>> > >>>>>>> To give you more details about deployment and configurations: > >>>>>>> web_server_worker_timeout = 120 > >>>>>>> workers = 4 > >>>>>>> sql_alchemy_pool_size = 5 > >>>>>>> sql_alchemy_pool_recycle = 3600 > >>>>>>> AWS RDS postgres > >>>>>>> AWS m4.large > >>>>>>> Does anyone know what can be causing this problem? > >>>>>>> > >>>>>>> Thank you :D > >>>>>>> > >>>>>> > >>>>>> > >>>> > >>> > >> >
