Re: [kepler-dev] extending kepler with workflow optimization algorithms

Efthymia Tsamoura Fri, 08 Jul 2011 02:23:49 -0700

Dear Christopher

I was actually thinking of developing a new director for kepler thatoptimizes a workflow using an algorithm that I have recently developedas part of my PhD. The characteristics of the optimization algorithmare the following:

1.The algorithm takes into account both the cost of the services andthe number of tokens passed between actors to produce an optimal (interms of cost) workflow.

2.It also accounts decentralized data transfers between the actors,i.e., it is assumed that the actors can be deployed anywhere in awide-area infrastructure, while they can exchange data directly withvarying data transferring costs.

3.Finally, it assumes that the data are exchanged among the actors ina pipelined fashion, so that the tuples already processed by an actorare processed by the subsequent actor in the workflow at the same timeas the former processes new input data.

(If you are interested, a technical report presenting the algorithmcan be found at:http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)

Although I have seen that kepler does not support decentralized datatransfers but only centralized ones (through the DistributedCompositeactor), i prefer kepler for experimentation because it is extremelywell documented and can be easily extended to support new functionality.

So, I was actually wondering if types of workflows, such as the onespresented in the example of my first email, are met. I have seen manyscientific workflows but in the majority of them the workflow tasksmust be in a predefined order, while altering the task ordering eitherdoes not produce the correct result, or does not change the amount oftokens passed between the actors.

I want to discover such workflow types/examples, in order to see ifsuch an optimization algorithm would be useful for the community.


Thank you very much for the quick reply


Best regards
Efi


Quoting Christopher Brooks <[email protected]>:

Hi Efthymia,
Scheduling of actor models is a deep topic.
Kepler uses Ptolemy II as its execution engine.
Ptolemy II is based on Ptolemy Classic.
In Ptolemy Classic, we had a number of different synchronousdataflow scheduling
algorithms, many of these were oriented towards clustering for parallel
processing.

In your example model, I see there being two costs:
1) The cost of the service.  For example FICO might charge more money
per query than an email lookup service.
2) The number of tokens passed between actors varies.

There is a relationship between these two costs where #1 is usually the
more important cost, but #2 can eventually overrule #1.

Offhand, I don't know of a model of computation that does exactly what
you want.

In Ptolemy II, models like your example model are typically Kahn
Process Network (PN) models where each actor is a separate process.

http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/
says:
"The following are the most important features of the operationalsemantics of
  PN as proposed by Kahn and MacQueen:

  * This is a network of sequential processes.
* The processes do not share memory. Instead they communicate witheach other
  only through unidirectional FIFO channels.
  * Communication between processes is asynchronous.
  * Processes block on a read from a FIFO channel if the channel is empty but
  can write to a channel whenever they want to."
In PN, there is not really a predefined schedule, the threadsoperate and then block.
Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
the number of tokens is known in advance and a schedule is definedin advance.
Dynamic Dataflow (DDF) is between the two, where the number of tokens passed
between actors is not known in advance.
There is another trivial director called the LeftToRightDirectorthat fires the actorsin order from LeftToRight. This would allow you to drag actorsaround and try
different firings.  That director is at
ptII/doc/tutorial/domains/LeftRightDirector.java

http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
says
"If you want to use a director not in Kepler tree, you have to usethe "Tools/Instantiate Attribute" menu. I use it to get a DDFdirector frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "
So, in Kepler, you would do Tools-> Instantiate Attribute and then
enter doc.tutorial.domains.LeftRightDirector, but that does
not seem to work in Kepler, so you would need to download Ptolemy II via
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/

The Timed Multitasking model (TM) is somewhat close to what you want
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/
says
--start--
The timed multitasking (TM) domain, created by Jie Liu, offers amodel of computation based on priority-driven multitasking, ascommon in real-time operating systems (RTOSs), but with moredeterministic behavior. In TM, actors (conceptually) execute asconcurrent threads in reaction to inputs. The domain provides anevent dispatcher, which maintains a prioritized event queue. Theexecution of an actor is triggered by the event dispatcher byinvoking first its prefire() method. The actor may begin executionof a concurrent thread at this time. Some time later, the dispatcherwill invoke the fire() and postfire() methods of the actor (unlessprefire() returns false).
The amount of time that elapses between the invocation of prefire()and fire() depends on the declared /executionTime/ and /priority/ ofthe actor (or more specifically, of the port of the port receivingthe triggering event). The domain assumes there is a singleresource, the CPU, shared by the execution of all actors. At oneparticular time, only one of the actors can get the resource andexecute. Execution of one actor may be preempted by another eligibleactor with a higher priority input event. If an actor is notpreempted, then the amount of time that elapses between prefire()and fire() equals the declared. /executionTime/. If it is preempted,then it equals the sum of the /executionTime/ and the executiontimes of the actors that preempt it. The model of computation ismore deterministic than the usual priority-driven multitaskingbecause the actor produces outputs (in its fire() method) only afterit has been assured access to the CPU for its declared/executionTime/. In this domain, the model time may be synchronizedto real time or not.
--end--
For an overview of the models of computation, see the "DomainOverview" link at
the top of
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm

or the first chapter of


     http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf


_Christopher






On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:
Hello
I am a phd student and during this period i am dealing withworkflow optimization problems in distributed environments. Iwould like to ask, if there are exist any cases where if the orderof task invocation in a scientific workflow changes its performancechanges too without, however, affecting the produced results. Inthe following, a present a small use case of the problem i aminterested in:
Suppose that a company wants to obtain a list of email addresses ofpotential customers selecting only those who have a good paymenthistory for at least one card and a credit rating above somethreshold. The company has the right to use the following webservices
WS1 : SSN id (ssn, threshold) -> credit rating (cr)
WS2 : SSN id (ssn) -> credit card numbers (ccn)
WS3 : card number (ccn, good) -> good history (gph)
WS4 : SSN id (ssn) -> email addresses (ea)
The input data containing customer identifiers (ssn) and otherrelevant information is stored in a local data resource. Twopossible web service linear workflows that can be formed to processthe input data using the above services are C1 = WS2,WS3,WS1,WS4and C2 = WS1,WS2,WS3,WS4. In the first workflow, first, thecustomers having a good payment history are initially selected(WS2,WS3), and then, the remaining customers whose credit historyis below some threshold are filtered out (through WS1). The C2workflow performs the same tasks in a reverse order. The abovelinear workflows may have different performance; if WS3 filters outmore data than WS1, then it will be more beneficial to invoke WS3before WS1 in order for the subsequent web services in the workflowto process less data.
It would be very useful to know if there exist similar scientificworkflow examples (where the order of task invocation can changeand it is not known a-priori by the user, while the workflowperformance depends on the workflow task invocation order) and ifyou are interested in extending kepler with optimization algorithmsfor such workflows.
I am asking because i have recently developed an optimizationalgorithm for this problem and i would like to test its performancein a real-world workflow management system with real-world workflows.
P.S.: references to publications or any other information dealingwith scientific workflows of the above rationale will be extremelyuseful.
Thank you very much for your time



_______________________________________________
Kepler-dev mailing list
[email protected]
http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
--
Christopher Brooks, PMP                       University of California
CHESS Executive Director                      US Mail: 337 Cory Hall
Programmer/Analyst CHESS/Ptolemy/Trust        Berkeley, CA 94720-1774
ph: 510.643.9841                              (Office: 545Q Cory)
home: (F-Tu) 707.665.0131 cell: 707.332.0670





_______________________________________________
Kepler-dev mailing list
[email protected]
http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev

Re: [kepler-dev] extending kepler with workflow optimization algorithms

Reply via email to