On 14/11/2013 12:23 AM, Kenneth Wilkerson wrote:
If I read the article you sent correctly, this algorithm is using a spin
lock. It has provision for implementing a lock-free algorithm but none of
those are detailed.

Most of the shared_ptr implementations in the wild, including z/OS, use a lock-free algorithm invented by Alexander Terekhov (IBM). An atomic reference counted smart pointer is a nice tool to have in your bag.

PLO has no restrictions other than the limits of
my imagination.

Or patents! I notice IBM have quite a few wrt PLO. This one for example http://www.google.com.mx/patents/US6636883. The US patent office seems to like patents related to multi-processing. Maybe they think it's novel.

Software patents have a funky smell!


However, in the last sentence of the performance section, the authors state
"Note that we make no claim that the atomic access approach unconditionally
outperforms the others, merely that it may be best for certain scenarios
that can be encountered in practice.". I think this is the key to lock-free
implementations. I have generalized primitive functions to perform most of
my PLO operations. But I design and tune each to the specific use. I
understand the desire to provide generalized functionality for high level
language use. However, I do not accept the premise that all lists are the
same. And I would certainly use different algorithms for lists that I
expected to get "large" and lists that have more than a small percentage of
update operations.

While it may not be true on z/OS, most software developers these days use high level languages for multi-threaded applications and prefer abstractions because they are easy to use and reason about. Of course, that doesn't mean that they shouldn't understand what's happening under the covers. The trick is keeping the optimizer out of the mix which is where inline assembler comes in handy.

There are many high-quality (free) libraries for lock-free data structures that are relatively easy to port to z/OS. Using advanced language features it's quite simple to configure a highly granular concurrent queue by using policies. The difficult part is testing the bloody things!

typedef mpmc_queue< fixed_array, lock_free, smart_ptr > multi_queue;
typedef spsc_queue< priority_list, mutex, unique_ptr > blocking_queue;
typedef thread_pool< multi_queue, max_threads<8> > task_pool;

socket_server<task_pool> server;


Kenneth
----------------------------------------------------------------------------
----


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]] On
Behalf Of David Crayford
Sent: Tuesday, November 12, 2013 11:45 PM
To: [email protected]
Subject: Re: Serialization without Enque

On 13/11/2013 12:34 PM, Kenneth Wilkerson wrote:
Actually, the algorithm performs well for read-often, write-rarely
list because the active chain count does not change and therefore
there are relatively infrequent re-drives.  The active chain count
only changes on an add or delete. So if there are infrequent adds and
deletes, there will be infrequent re-drives. And you are wrong,
readers will not contend unless two or more tasks are referencing the
exact same element simultaneously. And even then, the re-drive only
involves the update to the use count.

Thanks,  I get it now. Maybe IBM should have used PLO for the z/OS C++
shared_ptr SMR algorithm which has both weak/strong reference counts + the
pointer. They use a lock-free algorithm using CS
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2674.htm.
shared_ptr implements proxy based garbage collection.

There are a lot of optimization that I do not describe in this
algorithm for simplicity. For example, when you do a PLO DCAS, the
condition code is set to indicate which compare failed. This can be used
to optimize the re-drive.
There are many others optimizations with re-drives to make this more
efficient. And if you get away from traditional lists, there are even
more optimizations.
Like what? What do you replace linked lists with, arrays with offsets
instead of pointers?

Honestly, I provided this algorithm after reading the paper on hazard
pointers. The paper was written in 2002 and claimed there was no
atomic DCAS when PLO DCAS  became available in 2001. So I took a much
simpler algorithm that I had and modified it to use a use count to
accommodate traditional storage managers to prove that a PLO could be
used to manage a conventional list using a traditional storage manager
and provide a SMR algorithm without the need for DU level management
structures. I don't use many conventional lists and I have a proprietary
storage manager that does not use chains.
Most of my PLO operations are much simpler.
I would love to test this algorithm against any other SMR algorithm.
My point has been and remains, that PLO can be efficiently used to
serialize any list in a lock-free manner and even if it does take more
CPU this will be offset by increased throughput.

And just because UNIX has issues with PLO doesn't mean the issue is
with PLO...
UNIX doesn't have an issue with PLO. It clearly states that popping/pushing
elements at the beginning of the queue is a good performer. Surely your
algorithm would have the same problem if multiple producers/consumers were
inserting/removing elements from the middle of a long list.

Kenneth

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]]
On Behalf Of David Crayford
Sent: Tuesday, November 12, 2013 8:39 PM
To: [email protected]
Subject: Re: Serialization without Enque

Thanks for sharing your design Ken. It seems to me that PLO is best
used for data structures like double-ended queues where elements can
be inserted/removed from both ends of the queue atomically. In the
case of a read-often-write-rarely list with multiple readers that
traverse the list it doesn't seem optimal. Correct me if I'm wrong but
readers will contend with each other.

FYI, z/OS UNIX message queues can be configured to use PLO (with
fallback to a latch). The documentation states that inserting/removing
from the middle of a long list using PLO is a poor performer
http://pic.dhe.ibm.com/infocenter/zos/v1r12/topic/com.ibm.zos.r12.bpxb
100/qc
t.htm#qct.


There isn't a one-size-fits-all solution. It depends on the usage
patterns.
Hopefully HTM will solve that. Depending on how Intels Haswell TSX
implementation performs in the wild we could see HTM in our cell
phones as early as next year.


On 12/11/2013 11:18 PM, Kenneth Wilkerson wrote:
I use cell pools. I also use a proprietary storage manager that
doesn't use chains. These methodology offer me capabilities well
beyond those found in traditional methods. Much of what I do is based
on these capabilities, but the algorithms could easily be adapted to
use a conventional storage manager that uses chains.

Here is an algorithm I use that  I've adopted for traditional storage
management. This algorithm will work for any list; LIFO,  FIFO,
ordered or other , and for deletion from the head, middle or tail.

Setup: Allocate a word in each element. Bit 0 is one for all active
elements. Bit 1 is one for all elements pending deletion. Bit 2 is
`reserved for an extension to this algorithm (such as garbage
cleanup). The remaining bits are a use count allowing for many more
DUs
than are supported by MVS.
Note.: When PLO Compare and Swap (CS) or Double Compare and Swap
(DCAS) is referenced, the PLO uses the use count address as the lock
word. This will serialize all updates to the use counter for that
element. For the PLO Compare and Loads or PLO update, the lock word
is the
active chain counter.
To search:
Step 1:  Use the PLO Compare and Load on the active chain counter to
search the chain as before.  If the PLO fails, re-drive the search.

Step 2:  Before examining the element, increment the use count with a
PLO Double Compare and Swap (DCAS). Load the first register pair with
the current chain counter. The swap value will also be the current
chain counter. Essentially, we're using the active chain count to
serialize increments  to the use count to avoid accessing an area
that may have been released. The second register pair will contain
the current use count with a swap value incremented by 1 using an AL
to avoid resetting the high bit. If the PLO DCAS fails, the previous
PLO Compare and Load (Step 1) should be re-driven.

Step 3: Use the PLO Compare and Load for the next element. Save the
PLO status and decrement the use count with a SL using PLO CS. We
don't need a DCAS because the use count is not 0 and this element
can't be deleted.  If this PLO CS fails, re-drive it. If PLO Compare
and Load status is re-drive, then before re-driving the search, check
the use count (in the register used to update it). If  Bit 1 (pending
delete) is set and the use count is 0, this task can release it.

To delete:  (this assumes the deleting task has already updated the
use count in the process of finding the element to delete)
Step1: Use a PLO update function to remove the element from the
active chain to avoid any future references.

Step2:  If the PLO update  to remove the element fails,  decrement
the use count but do NOT set bit 1 (pending delete) using a PLO CS.
If the PLO CS fails, re-drive it .

Step 3: If the PLO update to remove the element succeeds, decrement
the use count,  SET bit 1 (pending delete) and RESET Bit 0 (active
bit) using a PL CS. If the PLO CS fails, re-drive it .
Step 4: Whether the PLO update succeeded or failed, check the use
count in the register used to update it:
                        If bit 1 (pending delete) is set and the use
count is not 0, then this task should exit .
                         If bit 1 (pending delete) is set and the use
count is 0, then this task can release it.
                        Otherwise, this task must re-drive the search
to find the element to be deleted

You can work out the various scenarios yourself. But because the
count is incremented/decremented after a PLO Compare and Load or
update, the status of the PLO provides a decision point on whether an
element may have been deleted. Using the use count address as the
lock word insures that all use count updates for a specific use count
occur serially. There are numerous extensions to this algorithm that
are more than I want to describe. Things like adding pending deletes
to a delete chain or having an asynchronous, periodic garbage
collection task
handle the deletes.
Kenneth
-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[email protected]]
On Behalf Of Jon Perryman
Sent: Monday, November 11, 2013 9:38 PM
To: [email protected]
Subject: Re: Serialization without Enque

Could you provide an insight to a design that would handle the
situation where an SSI program with non-trivial workload uses a very
large list. This list is normally semi-static but there can be
periods of time where entries are heavily deleted and added. Not
freeing storage is an option because it could be a significant amount of
storage.
Thanks, Jon Perryman.



________________________________
From: Tony Harminc <[email protected]>
To: [email protected]
Sent: Monday, November 11, 2013 7:07 PM
Subject: Re: Serialization without Enque


On 11 November 2013 20:15, Jon Perryman <[email protected]> wrote:
         L    R2,QUEUE
         L    R3,NEXT_ENTRY
        CS  R2,R3,QUEUE                    New queue head  While this
seems bullet proof, it's not. If there is a long delay between
between the L instructions then next entry could  have been freed
causing an
abend.
If your code is referencing an item that may have been freed, then
your design is wrong.

While the delay is very unlikely, it is still a possiblity (e.g.
swapout after first "L"). This example is just removing the first
entry.
If you try to remove an entry in the middle of the chain, you run a
larger risk.
Some of the techniques we use to minimize risk:
1. Abend recovery
2. don't overload the entry - only contains shared information.
3. Place entry on a timeout queue before placing on a free queue or
freeing.
4. Copy the entry to workarea - reduces reference time of actual
entry 5. Eyecatchers that can be verified.
6. Unique identifier for each entry that can be verified.
7. Turn off interrupts.
Most of these are indicative of incorrect design rather than of
careful attention to risk. If it's unclear whether a queue entry is
eligible to be worked on or even to be looked at unless you do it in
a hurry, then there is something quite wrong.

By all means provide abend recovery, but don't make it part of the
queue handling design.

Tony H.

--------------------------------------------------------------------
-
- For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO
IBM-MAIN



---------------------------------------------------------------------
- For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO
IBM-MAIN

---------------------------------------------------------------------
- For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO
IBM-MAIN
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send
email to [email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send
email to [email protected] with the message: INFO IBM-MAIN
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send email
to [email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to