Re: [kepler-dev] extending kepler with workflow optimization algorithms

Christopher Brooks Mon, 11 Jul 2011 10:06:19 -0700

Hi Efthymia,
I just stumbled across ptolemy/domains/sdf/optimize,
which has an OptimizingSDFDirector:
"OptimizingSDFScheduler is the scheduler companion to the OptimizingSDFDirector
It works with the synchronous dataflow (SDF) model of computation to find
an optimized schedule according to a defined criterion. "


This is the work of Marc Geilen, who visited Berkeley awhile back.
The code might be an interesting starting point for some of your ideas.

_Christopher

On 7/11/11 6:13 AM, Efthymia Tsamoura wrote:

Dear Edward

Thank you very much for the valuable feedback.

Best Regards,
Efi


Quoting "Edward A. Lee" <[email protected]>:


Dear Efi,

The question of whether order of invocation affects the computed
results is the question of determinacy. Here are a few papers
that address this:


http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-7.html

http://ptolemy.eecs.berkeley.edu/publications/papers/06/problemwithThreads/

http://ptolemy.eecs.berkeley.edu/publications/papers/95/processNets/

There are many other papers on the Ptolemy website that talk some
about the issue, but I think the ones above are examples
where it is a major focus.

Edward


On 7/8/11 2:22 AM, Efthymia Tsamoura wrote:

Dear Christopher

I was actually thinking of developing a new director for kepler that
optimizes a workflow using an algorithm that I have recently developed
as part of my PhD. The characteristics of the optimization algorithm are
the following:

1.The algorithm takes into account both the cost of the services and the
number of tokens passed between actors to produce an optimal (in terms
of cost) workflow.

2.It also accounts decentralized data transfers between the actors,
i.e., it is assumed that the actors can be deployed anywhere in a
wide-area infrastructure, while they can exchange data directly with
varying data transferring costs.

3.Finally, it assumes that the data are exchanged among the actors in a
pipelined fashion, so that the tuples already processed by an actor are
processed by the subsequent actor in the workflow at the same time as
the former processes new input data.

(If you are interested, a technical report presenting the algorithm can
be found at:
http://delab.csd.auth.gr/~tsamoura/tsamoura_2010_technical-report.pdf)

Although I have seen that kepler does not support decentralized data
transfers but only centralized ones (through the DistributedComposite
actor), i prefer kepler for experimentation because it is extremely well
documented and can be easily extended to support new functionality.

So, I was actually wondering if types of workflows, such as the ones
presented in the example of my first email, are met. I have seen many
scientific workflows but in the majority of them the workflow tasks must
be in a predefined order, while altering the task ordering either does
not produce the correct result, or does not change the amount of tokens
passed between the actors.

I want to discover such workflow types/examples, in order to see if such
an optimization algorithm would be useful for the community.

Thank you very much for the quick reply


Best regards
Efi


Quoting Christopher Brooks <[email protected]>:

Hi Efthymia,
Scheduling of actor models is a deep topic.
Kepler uses Ptolemy II as its execution engine.
Ptolemy II is based on Ptolemy Classic.
In Ptolemy Classic, we had a number of different synchronous dataflow
scheduling
algorithms, many of these were oriented towards clustering for parallel
processing.

In your example model, I see there being two costs:
1) The cost of the service. For example FICO might charge more money
per query than an email lookup service.
2) The number of tokens passed between actors varies.

There is a relationship between these two costs where #1 is usually the
more important cost, but #2 can eventually overrule #1.

Offhand, I don't know of a model of computation that does exactly what
you want.

In Ptolemy II, models like your example model are typically Kahn
Process Network (PN) models where each actor is a separate process.

http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/pn/doc/

says:

"The following are the most important features of the operational
semantics of
PN as proposed by Kahn and MacQueen:

* This is a network of sequential processes.
* The processes do not share memory. Instead they communicate with
each other
only through unidirectional FIFO channels.
* Communication between processes is asynchronous.
* Processes block on a read from a FIFO channel if the channel is
empty but
can write to a channel whenever they want to."

In PN, there is not really a predefined schedule, the threads operate
and then block.

Synchronous Dataflow (SDF) can be thought of as a subset of PN, where
the number of tokens is known in advance and a schedule is defined in
advance.

Dynamic Dataflow (DDF) is between the two, where the number of tokens
passed
between actors is not known in advance.

There is another trivial director called the LeftToRightDirector that
fires the actors
in order from LeftToRight. This would allow you to drag actors around
and try
different firings. That director is at
ptII/doc/tutorial/domains/LeftRightDirector.java

http://ptolemy.eecs.berkeley.edu/ptolemyII/ptIIfaq.htm#kepler
says
"If you want to use a director not in Kepler tree, you have to use the
"Tools/Instantiate Attribute" menu. I use it to get a DDF director
frequently (class ptolemy.domains.ddf.kernel.DDFDirector). "

So, in Kepler, you would do Tools-> Instantiate Attribute and then
enter doc.tutorial.domains.LeftRightDirector, but that does
not seem to work in Kepler, so you would need to download Ptolemy II via
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/

The Timed Multitasking model (TM) is somewhat close to what you want
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/ptolemy/domains/tm/doc/

says
--start--
The timed multitasking (TM) domain, created by Jie Liu, offers a model
of computation based on priority-driven multitasking, as common in
real-time operating systems (RTOSs), but with more deterministic
behavior. In TM, actors (conceptually) execute as concurrent threads
in reaction to inputs. The domain provides an event dispatcher, which
maintains a prioritized event queue. The execution of an actor is
triggered by the event dispatcher by invoking first its prefire()
method. The actor may begin execution of a concurrent thread at this
time. Some time later, the dispatcher will invoke the fire() and
postfire() methods of the actor (unless prefire() returns false).

The amount of time that elapses between the invocation of prefire()
and fire() depends on the declared /executionTime/ and /priority/ of
the actor (or more specifically, of the port of the port receiving the
triggering event). The domain assumes there is a single resource, the
CPU, shared by the execution of all actors. At one particular time,
only one of the actors can get the resource and execute. Execution of
one actor may be preempted by another eligible actor with a higher
priority input event. If an actor is not preempted, then the amount of
time that elapses between prefire() and fire() equals the declared.
/executionTime/. If it is preempted, then it equals the sum of the
/executionTime/ and the execution times of the actors that preempt it.
The model of computation is more deterministic than the usual
priority-driven multitasking because the actor produces outputs (in
its fire() method) only after it has been assured access to the CPU
for its declared /executionTime/. In this domain, the model time may
be synchronized to real time or not.

--end--



For an overview of the models of computation, see the "Domain
Overview" link at
the top of
http://ptolemy.eecs.berkeley.edu/ptolemyII/ptII8.0/ptII8.0.1/doc/index.htm


or the first chapter of


http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-28.pdf


_Christopher






On 7/7/11 2:10 AM, Efthymia Tsamoura wrote:

Hello

I am a phd student and during this period i am dealing with workflow
optimization problems in distributed environments. I would like to
ask, if there are exist any cases where if the order of task
invocation in a scientific workflow changes its performance changes
too without, however, affecting the produced results. In the
following, a present a small use case of the problem i am interested in:

Suppose that a company wants to obtain a list of email addresses of
potential customers selecting only those who have a good payment
history for at least one card and a credit rating above some
threshold. The company has the right to use the following web services

WS1 : SSN id (ssn, threshold) -> credit rating (cr)
WS2 : SSN id (ssn) -> credit card numbers (ccn)
WS3 : card number (ccn, good) -> good history (gph)
WS4 : SSN id (ssn) -> email addresses (ea)

The input data containing customer identifiers (ssn) and other
relevant information is stored in a local data resource. Two possible
web service linear workflows that can be formed to process the input
data using the above services are C1 = WS2,WS3,WS1,WS4 and C2 =
WS1,WS2,WS3,WS4. In the first workflow, first, the customers having a
good payment history are initially selected (WS2,WS3), and then, the
remaining customers whose credit history is below some threshold are
filtered out (through WS1). The C2 workflow performs the same tasks
in a reverse order. The above linear workflows may have different
performance; if WS3 filters out more data than WS1, then it will be
more beneficial to invoke WS3 before WS1 in order for the subsequent
web services in the workflow to process less data.

It would be very useful to know if there exist similar scientific
workflow examples (where the order of task invocation can change and
it is not known a-priori by the user, while the workflow performance
depends on the workflow task invocation order) and if you are
interested in extending kepler with optimization algorithms for such
workflows.

I am asking because i have recently developed an optimization
algorithm for this problem and i would like to test its performance
in a real-world workflow management system with real-world workflows.

P.S.: references to publications or any other information dealing
with scientific workflows of the above rationale will be extremely
useful.

Thank you very much for your time



_______________________________________________
Kepler-dev mailing list
[email protected]
http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev


--
Christopher Brooks, PMP University of California
CHESS Executive Director US Mail: 337 Cory Hall
Programmer/Analyst CHESS/Ptolemy/Trust Berkeley, CA 94720-1774
ph: 510.643.9841 (Office: 545Q Cory)
home: (F-Tu) 707.665.0131 cell: 707.332.0670





_______________________________________________
Kepler-dev mailing list
[email protected]
http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev


--
Christopher Brooks, PMP                       University of California
CHESS Executive Director                      US Mail: 337 Cory Hall
Programmer/Analyst CHESS/Ptolemy/Trust        Berkeley, CA 94720-1774
ph: 510.643.9841                              (Office: 545Q Cory)
home: (F-Tu) 707.665.0131 cell: 707.332.0670

_______________________________________________
Kepler-dev mailing list
[email protected]
http://lists.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev

Re: [kepler-dev] extending kepler with workflow optimization algorithms

Reply via email to