[jira] [Updated] (AIRFLOW-6093) Incorporate IBM-PyWren Plugin to contrib operators

Aitor Arjona (Jira) Wed, 27 Nov 2019 13:04:45 -0800


     [ 
https://issues.apache.org/jira/browse/AIRFLOW-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aitor Arjona updated AIRFLOW-6093:
----------------------------------
    Description: 
The aim of this issue is to add IBM-PyWren plugin operators to the default 
Airflow contrib/operators. This plugin can be found 
[here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
 * *What is IBM-PyWren?*

[PyWren|https://github.com/pywren/pywren] is an open source project whose goals 
are massively scaling the execution of Python code and its dependencies on 
serverless computing platforms and monitoring the results. PyWren delivers the 
user’s code into the serverless platform in a transparent way, without 
requiring knowledge of how functions are deployed, invoked and run.
 [IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's 
main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage. 
IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop 
IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of 
PyWren to run broader Map-Reduce jobs, based on Docker images. In extending 
PyWren to work with IBM Cloud Object Storage, it was also added a partition 
discovery component was that allows PyWren to process large amounts of data 
stored in the IBM Cloud Object Storage.
 * *What IBM-PyWren does and what has the Airflow plugin to offer?*

IBM-PyWren provides a clean and easy to use interface that implements the 
typical Map and Map-Reduce programming models, but executes every map function 
as a serverless function running in a serverless platform. This allows us to 
run embarrassingly parallel workloads, like some big data applications, that 
have already been proved to be very efficient and useful.

The following snippet portrays the simplicity of using IBM-PyWren:

 
{code:java}
import pywren_ibm_cloud as pywren

def my_map_function(id, x):
      return x + 7
      
      
iterdata = [1, 2, 3, 4]
pw = pywren.ibm_cf_executor()
pw.map(my_map_function, iterdata)
result = pw.get_result()   # result = [8, 9, 10, 11] 
{code}
IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code 
can continue execution until `{{get_result()}}` is called, where execution is 
blocked until the map results are ready.

There has been already implemented an Airflow plugin that adds the following 
operators and hooks:

- {{IbmPyWrenCallAsyncOperator(callable, data)}} : Invokes a single function
 - {{IbmPyWrenMapOperator(map_function, map_iterdata)}} : Invokes invokes 
multiple parallel tasks, as many as how much data is in parameter 
{{map_iterdata}}. It applies the function {{map_function}} to every element in 
{{map_iterdata.}}

{{- IbmPyWrenMapReduceOperator(map_function, map_iterdata, reduce_function)}}: 
It invokes multiple parallel tasks, as many as how much data is in parameter 
{{map_iterdata}}. It applies the function {{map_function}} to every element in 
{{map_terdata}}. Finally, in invokes a {{reduce_function}} that gathers all the 
map results.

- {{IbmPyWrenHook()}} : Provides a PyWren executor ready to use.\{{}}

 
 * *But Airflow already has operators to work with serverless platforms*

However, using IBM-PyWren as a client and runtime to work with serverless 
functions has some advantages that other clients don't have:
 - IBM-PyWren is specifically designed and optimized to run Map and Map-Reduce 
jobs, oriented towards data analytics uses.

 - IBM-PyWren is designed to automatically pickle already written user 
sequential code with its dependencies and massively parallelize its execution 
by running it in thousands of cores using a variety of serverless platforms 
even on different regions, including KNative, IBM Cloud Functions, AWS Lambda, 
GCP Functions and Azure Functions.

 - IBM-PyWren executor efficiently manages the invocation of +1000 functions 
and monitors its execution by wrapping the user code and handling exceptions 
when the function raises one, runs out of memory or times out. These errors 
wouldn't be handled in regular serverless function execution runtimes.

 - IBM-PyWren implements a data partitioner and discovery functionality that 
automatically parses an object from COS into smaller fixed-size chunks so that 
it is easier to iterate over the data.

 - IBM-PyWren eases the way user code communicates with object storage to get 
and put data or pass data between functions, or to send and receive events 
through rabbitmq queues, by providing ready to use clients that are already 
authenticated.

 - IBM-PyWren implements its calls following the 'future' or 'promise' 
asynchronous programming model. This allows us to execute the map and 
map-reduce jobs in an asynchronous way and consulting the result further into 
the code.

 * *Why IBM-PyWren + Airflow?*

Currently, Airflow doesn't fully take advantage of serverless cloud computing 
execution model. In order to scale out Airflow, we need to use kubernetes 
executor or celery executor, either way, we have to provision the 
infrastructure, which probably will be idle most of the time. Even then, if we 
want to process up to 1000 parallel tasks or more, we would need lots of horse 
power on our cluster, skyrocketing the maintenance price, or else be content 
with a slow data processing pipeline. IBM-PyWren brings massive scaling and 
parallelization of tasks to even the most humble Airflow cluster, in addition 
to only having to pay for the resources we actually use.

  was:
The aim of this issue is to add IBM-PyWren plugin operators to the default 
Airflow contrib/operators. This plugin can be found 
[here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
 * *What is IBM-PyWren?*

[PyWren|https://github.com/pywren/pywren] is an open source project whose goals 
are massively scaling the execution of Python code and its dependencies on 
serverless computing platforms and monitoring the results. PyWren delivers the 
user’s code into the serverless platform in a transparent way, without 
requiring knowledge of how functions are deployed, invoked and run.
[IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's 
main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage. 
IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop 
IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of 
PyWren to run broader Map-Reduce jobs, based on Docker images. In extending 
PyWren to work with IBM Cloud Object Storage, it was also added a partition 
discovery component was that allows PyWren to process large amounts of data 
stored in the IBM Cloud Object Storage.
 * *What IBM-PyWren does and what has the Airflow plugin to offer?*

IBM-PyWren provides a clean and easy to use interface that implements the 
typical Map and Map-Reduce programming models, but executes every map function 
as a serverless function running in a serverless platform. This allows us to 
run embarrassingly parallel workloads, like some big data applications, that 
have already been proved to be very efficient and useful.

The following snippet portrays the simplicity of using IBM-PyWren:

 
{code:java}
import pywren_ibm_cloud as pywren

def my_map_function(id, x):
      return x + 7
      
      
iterdata = [1, 2, 3, 4]
pw = pywren.ibm_cf_executor()
pw.map(my_map_function, iterdata)
result = pw.get_result()   # result = [8, 9, 10, 11] 
{code}
IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code 
can continue execution until `{{get_result()}}` is called, where execution is 
blocked until the map results are ready.

There has been already implemented an Airflow plugin that adds the following 
operators and hooks:

-{{ IbmPyWrenCallAsyncOperator(callable, data)}}: Invokes a single function
-{{ IbmPyWrenMapOperator(map_function, map_iterdata)}}: Invokes invokes 
multiple parallel tasks, as many as how much data is in parameter 
{{map_iterdata}}. It applies the function {{map_function}} to every element in 
{{map_iterdata.}}

{{- IbmPyWrenMapReduceOperator(map_function, map_iterdata, reduce_function)}}: 
It invokes multiple parallel tasks, as many as how much data is in parameter 
{{map_iterdata}}. It applies the function {{map_function}} to every element in 
{{map_terdata}}. Finally, in invokes a {{reduce_function}} that gathers all the 
map results.

- IbmPyWrenHook(): Provides a PyWren executor ready to use.{{}}

 
 * *But Airflow already has operators to work with serverless platforms*

However, using IBM-PyWren as a client and runtime to work with serverless 
functions has some advantages that other clients don't have:

- IBM-PyWren is specifically designed and optimized to run Map and Map-Reduce 
jobs, oriented towards data analytics uses.

- IBM-PyWren is designed to automatically pickle already written user 
sequential code with its dependencies and massively parallelize its execution 
by running it in thousands of cores using a variety of serverless platforms 
even on different regions, including KNative, IBM Cloud Functions, AWS Lambda, 
GCP Functions and Azure Functions.

- IBM-PyWren executor efficiently manages the invocation of +1000 functions and 
monitors its execution by wrapping the user code and handling exceptions when 
the function raises one, runs out of memory or times out. These errors wouldn't 
be handled in regular serverless function execution runtimes.

- IBM-PyWren implements a data partitioner and discovery functionality that 
automatically parses an object from COS into smaller fixed-size chunks so that 
it is easier to iterate over the data.

- IBM-PyWren eases the way user code communicates with object storage to get 
and put data or pass data between functions, or to send and receive events 
through rabbitmq queues, by providing ready to use clients that are already 
authenticated.

- IBM-PyWren implements its calls following the 'future' or 'promise' 
asynchronous programming model. This allows us to execute the map and 
map-reduce jobs in an asynchronous way and consulting the result further into 
the code.
 * *Why IBM-PyWren + Airflow?*

Currently, Airflow doesn't fully take advantage of serverless cloud computing 
execution model. In order to scale out Airflow, we need to use kubernetes 
executor or celery executor, either way, we have to provision the 
infrastructure, which probably will be idle most of the time. Even then, if we 
want to process up to 1000 parallel tasks or more, we would need lots of horse 
power on our cluster, skyrocketing the maintenance price, or else be content 
with a slow data processing pipeline. IBM-PyWren brings massive scaling and 
parallelization of tasks to even the most humble Airflow cluster, in addition 
to only having to pay for the resources we actually use.


> Incorporate IBM-PyWren Plugin to contrib operators
> --------------------------------------------------
>
>                 Key: AIRFLOW-6093
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6093
>             Project: Apache Airflow
>          Issue Type: New Feature
>          Components: contrib, hooks, operators
>    Affects Versions: 1.10.6
>            Reporter: Aitor Arjona
>            Assignee: Aitor Arjona
>            Priority: Minor
>              Labels: features
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The aim of this issue is to add IBM-PyWren plugin operators to the default 
> Airflow contrib/operators. This plugin can be found 
> [here|https://github.com/cloudbutton/ibm-pywren_airflow-plugin].
>  * *What is IBM-PyWren?*
> [PyWren|https://github.com/pywren/pywren] is an open source project whose 
> goals are massively scaling the execution of Python code and its dependencies 
> on serverless computing platforms and monitoring the results. PyWren delivers 
> the user’s code into the serverless platform in a transparent way, without 
> requiring knowledge of how functions are deployed, invoked and run.
>  [IBM-PyWren|https://github.com/pywren/pywren-ibm-cloud] is based on PyWren's 
> main branch and adapted for IBM Cloud Functions and IBM Cloud Object Storage. 
> IBM-PyWren is not, however, just a mere reimplementation of PyWren’s API atop 
> IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of 
> PyWren to run broader Map-Reduce jobs, based on Docker images. In extending 
> PyWren to work with IBM Cloud Object Storage, it was also added a partition 
> discovery component was that allows PyWren to process large amounts of data 
> stored in the IBM Cloud Object Storage.
>  * *What IBM-PyWren does and what has the Airflow plugin to offer?*
> IBM-PyWren provides a clean and easy to use interface that implements the 
> typical Map and Map-Reduce programming models, but executes every map 
> function as a serverless function running in a serverless platform. This 
> allows us to run embarrassingly parallel workloads, like some big data 
> applications, that have already been proved to be very efficient and useful.
> The following snippet portrays the simplicity of using IBM-PyWren:
>  
> {code:java}
> import pywren_ibm_cloud as pywren
> def my_map_function(id, x):
>       return x + 7
>       
>       
> iterdata = [1, 2, 3, 4]
> pw = pywren.ibm_cf_executor()
> pw.map(my_map_function, iterdata)
> result = pw.get_result()   # result = [8, 9, 10, 11] 
> {code}
> IBM-PyWren map and map-reduce jobs are asyncronous calls, that is, user code 
> can continue execution until `{{get_result()}}` is called, where execution is 
> blocked until the map results are ready.
> There has been already implemented an Airflow plugin that adds the following 
> operators and hooks:
> - {{IbmPyWrenCallAsyncOperator(callable, data)}} : Invokes a single function
>  - {{IbmPyWrenMapOperator(map_function, map_iterdata)}} : Invokes invokes 
> multiple parallel tasks, as many as how much data is in parameter 
> {{map_iterdata}}. It applies the function {{map_function}} to every element 
> in {{map_iterdata.}}
> {{- IbmPyWrenMapReduceOperator(map_function, map_iterdata, 
> reduce_function)}}: It invokes multiple parallel tasks, as many as how much 
> data is in parameter {{map_iterdata}}. It applies the function 
> {{map_function}} to every element in {{map_terdata}}. Finally, in invokes a 
> {{reduce_function}} that gathers all the map results.
> - {{IbmPyWrenHook()}} : Provides a PyWren executor ready to use.\{{}}
>  
>  * *But Airflow already has operators to work with serverless platforms*
> However, using IBM-PyWren as a client and runtime to work with serverless 
> functions has some advantages that other clients don't have:
>  - IBM-PyWren is specifically designed and optimized to run Map and 
> Map-Reduce jobs, oriented towards data analytics uses.
>  - IBM-PyWren is designed to automatically pickle already written user 
> sequential code with its dependencies and massively parallelize its execution 
> by running it in thousands of cores using a variety of serverless platforms 
> even on different regions, including KNative, IBM Cloud Functions, AWS 
> Lambda, GCP Functions and Azure Functions.
>  - IBM-PyWren executor efficiently manages the invocation of +1000 functions 
> and monitors its execution by wrapping the user code and handling exceptions 
> when the function raises one, runs out of memory or times out. These errors 
> wouldn't be handled in regular serverless function execution runtimes.
>  - IBM-PyWren implements a data partitioner and discovery functionality that 
> automatically parses an object from COS into smaller fixed-size chunks so 
> that it is easier to iterate over the data.
>  - IBM-PyWren eases the way user code communicates with object storage to get 
> and put data or pass data between functions, or to send and receive events 
> through rabbitmq queues, by providing ready to use clients that are already 
> authenticated.
>  - IBM-PyWren implements its calls following the 'future' or 'promise' 
> asynchronous programming model. This allows us to execute the map and 
> map-reduce jobs in an asynchronous way and consulting the result further into 
> the code.
>  * *Why IBM-PyWren + Airflow?*
> Currently, Airflow doesn't fully take advantage of serverless cloud computing 
> execution model. In order to scale out Airflow, we need to use kubernetes 
> executor or celery executor, either way, we have to provision the 
> infrastructure, which probably will be idle most of the time. Even then, if 
> we want to process up to 1000 parallel tasks or more, we would need lots of 
> horse power on our cluster, skyrocketing the maintenance price, or else be 
> content with a slow data processing pipeline. IBM-PyWren brings massive 
> scaling and parallelization of tasks to even the most humble Airflow cluster, 
> in addition to only having to pay for the resources we actually use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AIRFLOW-6093) Incorporate IBM-PyWren Plugin to contrib operators

Reply via email to