There are two dimension to evaluate how much resource all sensors take in Airflow: the number of sensors and the duration of each sensor task take. Batch/smart sensor idea is proposed for the first one and the rescheduling is for the second one. For airflow cluster running large number of sensor tasks, the batch/smart sensor use less than 10% of sensor resource compared with regular sensor.
On Thu, Mar 7, 2019 at 2:36 AM Ash Berlin-Taylor <a...@apache.org> wrote: > Rescheduling is of massive use for a DAG where we are waiting for a weekly > S3 file delivery from a third party supplier with _massive_ variance in the > delivery time. It'll appear at some point between Thursday AM and Sunday > evening. Not having an executor slot tied up with the S3KeySensor is great > for this. > > -ash > > > On 6 Mar 2019, at 21:51, Alex Guziel <alex.guz...@airbnb.com.INVALID> > wrote: > > > > Smart sensor seems like a good idea, but I wonder how much performance > will > > be improved in practice. And of course, one must think about sharding and > > such. > > > > I'm not sure how helpful rescheduling sensors is, since it will add > > scheduler and DB load seemingly, which is already a bottleneck. > > > > On Wed, Mar 6, 2019 at 12:43 PM Yingbo Wang <ybw...@gmail.com> wrote: > > > >> I would still like to get some feedback on the batch sensor/smart sensor > >> idea after viewing the sensor rescheduling PR. Since the reschedule mode > >> does not reduce the number of worker processes for sensor. The batch > sensor > >> idea is proposed for this purpose and should work well with reschedule > >> mode. > >> > >> On Wed, Mar 6, 2019 at 11:30 AM Yingbo Wang <ybw...@gmail.com> wrote: > >> > >>> Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We > >> are > >>> super happy to have this feature. > >>> > >>> On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fo...@driesprong.frl > > > >>> wrote: > >>> > >>>> Thanks for bringing this up. I've added a comment on the Wiki: > >>>> > >>>> > >> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization > >>>> > >>>> Have you looked into the work by Seelmann? Recently he introduced the > >>>> ability to reschedule sensors. When rescheduling, the slot will be > given > >>>> back to the scheduler after a poke operation. Therefore the slot won't > >> be > >>>> occupied all the time. The details are in the PR > >>>> https://github.com/apache/airflow/pull/3596 > >>>> > >>>> I would propose to make this the default behavior in Airflow 2.0. > >>>> > >>>> Cheers, Fokko > >>>> > >>>> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybw...@gmail.com>: > >>>> > >>>>> hi, > >>>>> > >>>>> I would like to open an AIP for Airflow sensor optimization. > >>>>> > >>>>> > >>>>> *Motivation*: > >>>>> > >>>>> Low efficiency in Airflow Sensor Implementation > >>>>> > >>>>> Sensors are a special kind of operator that will keep running until a > >>>>> certain criterion is met. Examples include a specific file landing in > >>>> HDFS > >>>>> or S3, a partition appearing in Hive, or a specific time of the day. > >>>>> Sensors are derived from BaseSensorOperator and run a poke method at > a > >>>>> specified poke_interval until it returns True. > >>>>> > >>>>> The reason that the sensor tasks are inefficient is because in > current > >>>>> design, we sprawn a separate worker process for each partition > sensor. > >>>> This > >>>>> worker might last a long time, until the target partition is > >>>> available. In > >>>>> the case where there are many sensor tasks that need to run within > >>>> certain > >>>>> time limits, we have to allocate a lot of resources to have enough > >>>> workers > >>>>> for the sensor tasks. > >>>>> > >>>>> *Idea:* > >>>>> > >>>>> We propose two approaches that could address this issues, > batch-sensor > >>>>> and smart-sensor. > >>>>> > >>>>> > >>>>> > >>>>> Batch-sensor > >>>>> > >>>>> The basic idea of batch-sensor is to batch process sensor tasks to > >> save > >>>>> resources. During running, a batch-sensor will take N partition > sensor > >>>>> requests as the input and poke those N partitions periodically. If > the > >>>>> batch-sensor finds that the criteria of some sensor task is met, the > >>>>> batch-sensor will update the database about this sensor tasks. > >>>>> > >>>>> > >>>>> To do this, we need to create a sensor basic class called ‘batchable’ > >>>> and > >>>>> make all sensors inherit from this basic class. We also need to > change > >>>> the > >>>>> behavior of schedule regarding a batchable sensor tasks. The schedule > >>>> will > >>>>> find as many as possible batchable sensor tasks and run those tasks > >> in a > >>>>> batch. > >>>>> > >>>>> > >>>>> Smart-sensor > >>>>> > >>>>> Smart-sensor is an improvement on top of batch-sensor. > >>>>> > >>>>> The idea of smart-sensor is that the worker process of smart-sensor > >> will > >>>>> run like a service. To do this, we need to persist Sensor details in > >>>>> Airflow DB and the worker process periodically queries task-instance > >>>> table > >>>>> to find sensor tasks; poke the metastore and update the task instance > >>>> table > >>>>> if it detects that certain partition or file created. > >>>>> > >>>> > >>> > >> > >