[
https://issues.apache.org/jira/browse/AIRFLOW-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aitor Arjona updated AIRFLOW-6093:
----------------------------------
Description:
The aim of this issue is to add IBM-PyWren plugin operators to the default
Airflow contrib/operators. This plugin can be found
[here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
* *What is IBM-PyWren?*
[PyWren|https://github.com/pywren/pywren] is an open source project whose goals
are massively scaling the execution of Python code and its dependencies on
serverless computing platforms and monitoring the results. PyWren delivers the
user’s code into the serverless platform in a transparent way, without
requiring knowledge of how functions are deployed, invoked and run.
[IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's
main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage.
IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop
IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of
PyWren to run broader Map-Reduce jobs, based on Docker images. In extending
PyWren to work with IBM Cloud Object Storage, it was also added a partition
discovery component was that allows PyWren to process large amounts of data
stored in the IBM Cloud Object Storage.
* *What IBM-PyWren does and what has the Airflow plugin to offer?*
IBM-PyWren provides a clean and easy to use interface that implements the
typical Map and Map-Reduce programming models, but executes every map function
as a serverless function running in a serverless platform. This allows us to
run embarrassingly parallel workloads, like some big data applications, that
have already been proved to be very efficient and useful.
The following snippet portrays the simplicity of using IBM-PyWren:
{code:java}
import pywren_ibm_cloud as pywren
def my_map_function(id, x):
return x + 7
iterdata = [1, 2, 3, 4]
pw = pywren.ibm_cf_executor()
pw.map(my_map_function, iterdata)
result = pw.get_result() # result = [8, 9, 10, 11]
{code}
IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code
can continue execution until `{{get_result()}}` is called, where execution is
blocked until the map results are ready.
There has been already implemented an Airflow plugin that adds the following
operators and hooks:
- {{IbmPyWrenCallAsyncOperator(callable, data)}} : Invokes a single function
- {{IbmPyWrenMapOperator(map_function, map_iterdata)}} : Invokes invokes
multiple parallel tasks, as many as how much data is in parameter
{{map_iterdata}}. It applies the function {{map_function}} to every element in
{{map_iterdata.}}
{{- IbmPyWrenMapReduceOperator(map_function, map_iterdata, reduce_function)}}:
It invokes multiple parallel tasks, as many as how much data is in parameter
{{map_iterdata}}. It applies the function {{map_function}} to every element in
{{map_terdata}}. Finally, in invokes a {{reduce_function}} that gathers all the
map results.
- {{IbmPyWrenHook()}} : Provides a PyWren executor ready to use.\{{}}
* *But Airflow already has operators to work with serverless platforms*
However, using IBM-PyWren as a client and runtime to work with serverless
functions has some advantages that other clients don't have:
- IBM-PyWren is specifically designed and optimized to run Map and Map-Reduce
jobs, oriented towards data analytics uses.
- IBM-PyWren is designed to automatically pickle already written user
sequential code with its dependencies and massively parallelize its execution
by running it in thousands of cores using a variety of serverless platforms
even on different regions, including KNative, IBM Cloud Functions, AWS Lambda,
GCP Functions and Azure Functions.
- IBM-PyWren executor efficiently manages the invocation of +1000 functions
and monitors its execution by wrapping the user code and handling exceptions
when the function raises one, runs out of memory or times out. These errors
wouldn't be handled in regular serverless function execution runtimes.
- IBM-PyWren implements a data partitioner and discovery functionality that
automatically parses an object from COS into smaller fixed-size chunks so that
it is easier to iterate over the data.
- IBM-PyWren eases the way user code communicates with object storage to get
and put data or pass data between functions, or to send and receive events
through rabbitmq queues, by providing ready to use clients that are already
authenticated.
- IBM-PyWren implements its calls following the 'future' or 'promise'
asynchronous programming model. This allows us to execute the map and
map-reduce jobs in an asynchronous way and consulting the result further into
the code.
* *Why IBM-PyWren + Airflow?*
Currently, Airflow doesn't fully take advantage of serverless cloud computing
execution model. In order to scale out Airflow, we need to use kubernetes
executor or celery executor, either way, we have to provision the
infrastructure, which probably will be idle most of the time. Even then, if we
want to process up to 1000 parallel tasks or more, we would need lots of horse
power on our cluster, skyrocketing the maintenance price, or else be content
with a slow data processing pipeline. IBM-PyWren brings massive scaling and
parallelization of tasks to even the most humble Airflow cluster, in addition
to only having to pay for the resources we actually use.
was:
The aim of this issue is to add IBM-PyWren plugin operators to the default
Airflow contrib/operators. This plugin can be found
[here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
* *What is IBM-PyWren?*
[PyWren|https://github.com/pywren/pywren] is an open source project whose goals
are massively scaling the execution of Python code and its dependencies on
serverless computing platforms and monitoring the results. PyWren delivers the
user’s code into the serverless platform in a transparent way, without
requiring knowledge of how functions are deployed, invoked and run.
[IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's
main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage.
IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop
IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of
PyWren to run broader Map-Reduce jobs, based on Docker images. In extending
PyWren to work with IBM Cloud Object Storage, it was also added a partition
discovery component was that allows PyWren to process large amounts of data
stored in the IBM Cloud Object Storage.
* *What IBM-PyWren does and what has the Airflow plugin to offer?*
IBM-PyWren provides a clean and easy to use interface that implements the
typical Map and Map-Reduce programming models, but executes every map function
as a serverless function running in a serverless platform. This allows us to
run embarrassingly parallel workloads, like some big data applications, that
have already been proved to be very efficient and useful.
The following snippet portrays the simplicity of using IBM-PyWren:
{code:java}
import pywren_ibm_cloud as pywren
def my_map_function(id, x):
return x + 7
iterdata = [1, 2, 3, 4]
pw = pywren.ibm_cf_executor()
pw.map(my_map_function, iterdata)
result = pw.get_result() # result = [8, 9, 10, 11]
{code}
IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code
can continue execution until `{{get_result()}}` is called, where execution is
blocked until the map results are ready.
There has been already implemented an Airflow plugin that adds the following
operators and hooks:
-{{ IbmPyWrenCallAsyncOperator(callable, data)}}: Invokes a single function
-{{ IbmPyWrenMapOperator(map_function, map_iterdata)}}: Invokes invokes
multiple parallel tasks, as many as how much data is in parameter
{{map_iterdata}}. It applies the function {{map_function}} to every element in
{{map_iterdata.}}
{{- IbmPyWrenMapReduceOperator(map_function, map_iterdata, reduce_function)}}:
It invokes multiple parallel tasks, as many as how much data is in parameter
{{map_iterdata}}. It applies the function {{map_function}} to every element in
{{map_terdata}}. Finally, in invokes a {{reduce_function}} that gathers all the
map results.
- IbmPyWrenHook(): Provides a PyWren executor ready to use.{{}}
* *But Airflow already has operators to work with serverless platforms*
However, using IBM-PyWren as a client and runtime to work with serverless
functions has some advantages that other clients don't have:
- IBM-PyWren is specifically designed and optimized to run Map and Map-Reduce
jobs, oriented towards data analytics uses.
- IBM-PyWren is designed to automatically pickle already written user
sequential code with its dependencies and massively parallelize its execution
by running it in thousands of cores using a variety of serverless platforms
even on different regions, including KNative, IBM Cloud Functions, AWS Lambda,
GCP Functions and Azure Functions.
- IBM-PyWren executor efficiently manages the invocation of +1000 functions and
monitors its execution by wrapping the user code and handling exceptions when
the function raises one, runs out of memory or times out. These errors wouldn't
be handled in regular serverless function execution runtimes.
- IBM-PyWren implements a data partitioner and discovery functionality that
automatically parses an object from COS into smaller fixed-size chunks so that
it is easier to iterate over the data.
- IBM-PyWren eases the way user code communicates with object storage to get
and put data or pass data between functions, or to send and receive events
through rabbitmq queues, by providing ready to use clients that are already
authenticated.
- IBM-PyWren implements its calls following the 'future' or 'promise'
asynchronous programming model. This allows us to execute the map and
map-reduce jobs in an asynchronous way and consulting the result further into
the code.
* *Why IBM-PyWren + Airflow?*
Currently, Airflow doesn't fully take advantage of serverless cloud computing
execution model. In order to scale out Airflow, we need to use kubernetes
executor or celery executor, either way, we have to provision the
infrastructure, which probably will be idle most of the time. Even then, if we
want to process up to 1000 parallel tasks or more, we would need lots of horse
power on our cluster, skyrocketing the maintenance price, or else be content
with a slow data processing pipeline. IBM-PyWren brings massive scaling and
parallelization of tasks to even the most humble Airflow cluster, in addition
to only having to pay for the resources we actually use.
> Incorporate IBM-PyWren Plugin to contrib operators
> --------------------------------------------------
>
> Key: AIRFLOW-6093
> URL: https://issues.apache.org/jira/browse/AIRFLOW-6093
> Project: Apache Airflow
> Issue Type: New Feature
> Components: contrib, hooks, operators
> Affects Versions: 1.10.6
> Reporter: Aitor Arjona
> Assignee: Aitor Arjona
> Priority: Minor
> Labels: features
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> The aim of this issue is to add IBM-PyWren plugin operators to the default
> Airflow contrib/operators. This plugin can be found
> [here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
> * *What is IBM-PyWren?*
> [PyWren|https://github.com/pywren/pywren] is an open source project whose
> goals are massively scaling the execution of Python code and its dependencies
> on serverless computing platforms and monitoring the results. PyWren delivers
> the user’s code into the serverless platform in a transparent way, without
> requiring knowledge of how functions are deployed, invoked and run.
> [IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's
> main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage.
> IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop
> IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of
> PyWren to run broader Map-Reduce jobs, based on Docker images. In extending
> PyWren to work with IBM Cloud Object Storage, it was also added a partition
> discovery component was that allows PyWren to process large amounts of data
> stored in the IBM Cloud Object Storage.
> * *What IBM-PyWren does and what has the Airflow plugin to offer?*
> IBM-PyWren provides a clean and easy to use interface that implements the
> typical Map and Map-Reduce programming models, but executes every map
> function as a serverless function running in a serverless platform. This
> allows us to run embarrassingly parallel workloads, like some big data
> applications, that have already been proved to be very efficient and useful.
> The following snippet portrays the simplicity of using IBM-PyWren:
>
> {code:java}
> import pywren_ibm_cloud as pywren
> def my_map_function(id, x):
> return x + 7
>
>
> iterdata = [1, 2, 3, 4]
> pw = pywren.ibm_cf_executor()
> pw.map(my_map_function, iterdata)
> result = pw.get_result() # result = [8, 9, 10, 11]
> {code}
> IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code
> can continue execution until `{{get_result()}}` is called, where execution is
> blocked until the map results are ready.
> There has been already implemented an Airflow plugin that adds the following
> operators and hooks:
> - {{IbmPyWrenCallAsyncOperator(callable, data)}} : Invokes a single function
> - {{IbmPyWrenMapOperator(map_function, map_iterdata)}} : Invokes invokes
> multiple parallel tasks, as many as how much data is in parameter
> {{map_iterdata}}. It applies the function {{map_function}} to every element
> in {{map_iterdata.}}
> {{- IbmPyWrenMapReduceOperator(map_function, map_iterdata,
> reduce_function)}}: It invokes multiple parallel tasks, as many as how much
> data is in parameter {{map_iterdata}}. It applies the function
> {{map_function}} to every element in {{map_terdata}}. Finally, in invokes a
> {{reduce_function}} that gathers all the map results.
> - {{IbmPyWrenHook()}} : Provides a PyWren executor ready to use.\{{}}
>
> * *But Airflow already has operators to work with serverless platforms*
> However, using IBM-PyWren as a client and runtime to work with serverless
> functions has some advantages that other clients don't have:
> - IBM-PyWren is specifically designed and optimized to run Map and
> Map-Reduce jobs, oriented towards data analytics uses.
> - IBM-PyWren is designed to automatically pickle already written user
> sequential code with its dependencies and massively parallelize its execution
> by running it in thousands of cores using a variety of serverless platforms
> even on different regions, including KNative, IBM Cloud Functions, AWS
> Lambda, GCP Functions and Azure Functions.
> - IBM-PyWren executor efficiently manages the invocation of +1000 functions
> and monitors its execution by wrapping the user code and handling exceptions
> when the function raises one, runs out of memory or times out. These errors
> wouldn't be handled in regular serverless function execution runtimes.
> - IBM-PyWren implements a data partitioner and discovery functionality that
> automatically parses an object from COS into smaller fixed-size chunks so
> that it is easier to iterate over the data.
> - IBM-PyWren eases the way user code communicates with object storage to get
> and put data or pass data between functions, or to send and receive events
> through rabbitmq queues, by providing ready to use clients that are already
> authenticated.
> - IBM-PyWren implements its calls following the 'future' or 'promise'
> asynchronous programming model. This allows us to execute the map and
> map-reduce jobs in an asynchronous way and consulting the result further into
> the code.
> * *Why IBM-PyWren + Airflow?*
> Currently, Airflow doesn't fully take advantage of serverless cloud computing
> execution model. In order to scale out Airflow, we need to use kubernetes
> executor or celery executor, either way, we have to provision the
> infrastructure, which probably will be idle most of the time. Even then, if
> we want to process up to 1000 parallel tasks or more, we would need lots of
> horse power on our cluster, skyrocketing the maintenance price, or else be
> content with a slow data processing pipeline. IBM-PyWren brings massive
> scaling and parallelization of tasks to even the most humble Airflow cluster,
> in addition to only having to pay for the resources we actually use.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)