Yes, unless an in-memory and local DB, then you have another layer of network abstraction plus the conversion factor on lookup unless talking about some kind of an OO DB. But, when needing to scale to a number of instances, then it may very well be unavoidable to go to disk. Also, without such a scheme (some form of persistence) there is no way to have long term fail safe persistence which I believe is a general requirement for a distributed system, and one which it seems JavaSpaces perhaps implies since it uses transactions. What happens to the distributed system when the shared memory goes up and down even after a transaction has completed successfully and not take operation has occurred?
part thinking out loud, and part solution...perhaps anyways I'm thinking if some things are done a bit differently, then perhaps certain patterns do lend to a DB perhaps or better to a hashmap system. Though I think the issue with the DB, extra network, etc, are still an issue which a local file or memory system works around. Thinking about the conversion a bit, and considering JavaSpaces implementations do not run or access logic in the instances they are housing, it really seems a waste of time to serialize to the Java serialized format and then instantiate those same instances to place them into some list, hash, other, etc. The type of entry and its fields are what are used to find something and per equality. Some in memory implementation, in-memory from the client perspective where the whole of JavaSpaces is never transmitted over the air or wire, could just skip serialization all the way around perhaps ...except for the issue with it being a copy... Serialization is probably always a good idea in that case; of course depending on the requirements, but for some default implementation for sure. Now, the one place that doesn't seem a waste of time is using those unmarshalled Entries and fields for lookup. Seems having those realized in memory already is much faster. A way to speed up the overall effort would be to store the serialized form along with the realized/unmarshalled form in the list or what ever data structure since these held instances are immutable per the write/read/take operations; no reserialization. Then the serialized form can be returned immediately. More memory of course, but that could be limited with a logical archiving mechanism for memory which hasn't been accessed for a period of time (paging essentially of the serialized form). A hash, depending on size, would of course work for that whether on disk or in memory. Dwelling further, it doesn't seem probable to just take the Entry fields and serialize them associate them with their hash (hashCode) and then use something to compare those things because there is a possibility that hashCode hasn't been overridden and thus equals and hashCode won't necessarily both represent equality with the same meaning. Were there some guarantee that is the case, then that would be possible to do. However, and contrary to what I just wrote, since here I'm not talking about Object.hashCode and equals, it may be enough to just serialize each field of an Entry and take the whole (serialized field that is) and store it as a hash for comparison while the serialized form of the overall Entry can be stored for reads/takes; no reserialization needed this way when sending an object back to a caller. Perhaps those serialized fields can be taken from the Java serialization format directly. The work flow of a distributed spaces implementation may have the following work flow: write: 1) A JavaSpaces service is looked up. 2) An Entry is written to the space. 3) The client impl serializes the object with regular Java serialization and sends it over the wire/air 4) The server impl takes the serialized instance and scans it; while scanning the fields of the Entry (those matching spec requirements of course) and while streaming them out is generating a hash or checksum for them. In other words, checks the format while at the same time generating hashes which can be used to check for equality since the specification states only equality will be used for lookup. 5) The serialized form is stored in memory or to disk along with information about the types this class extends or implements 6) The field names and hashes are stored and reference some key such as the hash for the overall serialized form which is used in a hash map for memory or a filename if on disk. Equal hashes should be OK if a reference counter is used to save memory and keep from storing the exact same Entries in more than one file if write is used more than once. 7) The serialized form is stored along with information about the types this class extends or implements read/take (will leave out removal part of take here...simplest part): 1) A JavaSpaces service is looked up. 2) An Entry is requested with a template 3) The client impl serializes the template and sends it 4) The server impl performs roughly the same operations as 4 of course ignoring the null fields of the Entry (wild cards) 5) The generated hashes are used to look at only the required types and fields 6) The first match is returned That seems like a good base of something which can start to incorporate clustering for the contained data as well as support disk or memory based contents. A single data structure would not support everything needed to be accomplished perhaps, but locking could be reduced to a couple locations I believe if memory is segmented into multiple containers. For disk, a structure versus a single file could work well. For instance, the above lends itself to hashmaps. HashMaps can hold their contents in memory or in regular random access files since we are just talking buckets. The files can hold the buckets where the keys and base file names can be held. The key is to having segments and those being of a good size to not by default consume large amounts of memory yet provide as few steps as possible for a system which is some where in between. By using a hash, versus just an array or list, we can move around information per its key or hash, and this same key can be placed into any given segment and still function just the same if the size of the hashmaps are controlled and set to always be the same size. Too, locking can then be done away with on the hashmap segments all together since we won't actually resize and the hash and buckets don't need to be recalculated. Now, there may be times when memory is needed to be regained from unused maps, so some locking would be needed for that, but the time to lookup a value would be drastically reduced, and for the most part less the field comparisons a lookup would take 1n where n is the number of hashmaps which would be considerably smaller than where n is the number of Entries as we have now. Too, where we're generating hashes while we're scanning the serialized Java class that should help in the cost savings even though we have added a new process. Luckily hashes are additive and can be made part of the process overall. If during the scanning and hash generation there are any errors with the serialized format, an error can be thrown as before where unmarshalling and object would have taken place. Clustering, using something like Shoal, could then be possible with only the need to send the hashes, field names, types, and serialized format or the pieces required to do the job. This way each node in the cluster doesn't have to perform the heavier operations and instead just store them using the hash mechanisms. At least in a replicating cluster that is. Anyways, this email is way too long as it is, but that seems like a good start. The other parts to figure out are the exact places locking would need to take place, how exactly the field references would be structured and stored, how much memory that then entails, but something like class_field_hashofvalue (this becomes a hashmap key) and the value pointed to is the hash used in the other hashmap for actual object lookup. That final hash can be used to pull out the serialized form from disk or directly from an in memory hashmap; too that hash is a hash of the overall serialized format and can be used to determine if the final value really needs to be stored again or not. With that there are some counters which need to be incorporated into some formal design. So that's part of my idea. I'm still dwelling on the overall data structure and is why I haven't really commented more yet. Wade ================== Wade Chandler Software Engineer and Developer NetBeans Dream Team Member and Contributor http://wiki.netbeans.org/wiki/view/NetBeansDreamTeam http://www.netbeans.org ----- Original Message ---- > From: Mike McGrady <mmcgr...@topiatechnology.com> > To: "river-dev@incubator.apache.org" <river-dev@incubator.apache.org> > Cc: "river-dev@incubator.apache.org" <river-dev@incubator.apache.org> > Sent: Thu, December 16, 2010 10:04:55 AM > Subject: Re: datastructure classes > > Intense network systems like our clients cannot work with database calls. >They are too slow and do not scale. They either reach a cost or a >performance >ceiling. > > Sent from my iPhone > > Michael McGrady > Principal investigator AF081_028 SBIR > Chief Architect > Topia Technology, Inc > Work 1.253.572.9712 > Cel 1.253.720.3365 > > On Dec 16, 2010, at 5:55 AM, Patricia Shanahan <p...@acm.org> wrote: > > > MICHAEL MCGRADY wrote: > >> Patricia, > >> If you don't mind, I am going to argue for sticking to the machines, > >> but answer your question roughly at the end. > > > > I don't mind machines, as long as they get somewhat quantified. It was > > "machines a minute" that really gave me trouble. However, there are > > machines and machines, so I would rather think in terms of transaction > > rates. > > > >> The questions about either memory use or connections or transactions > >> are stressors leading to the question and the urgency whether or not > >> something must scale, but they have nothing or very little to do with > >> whether it will scale. > >> If we try to involve multiple machines, then we will discover > >> frequently that, even if we do not need to scale, we cannot because, > >> for example, of Amdahl's Theorem. If we cannot go to multiple > >> machines as a practical matter, we cannot scale no matter what > >> (ceteris paribus). If we can, then we will find, it is our > >> experience, that we can scale no matter what (ceteris paribus). > > > > Scaling is exactly the reason why I think the current FastList-based > > design is seriously limited, and may need to be replaced by something > > that indexes its data. > > > > Outrigger turns a JavaSpace read into a linked > > list scan of all the entries in the space for the required type. Not too > > bad if the list rarely changes and is always small enough to fit in > > cache. If it is big or changing, that is going to cause a lot of memory > > traffic. > > > > As implemented, it cannot be spread across machines very > > effectively because the additional scanning due to one machine not knowing >that another has already found the entry would use up the added scan power. > > > > I think to scale it is going to need some sort of indexing, and as I > > think about indexing for the sort of ad-hoc query that JavaSpaces face, > > combined with frequent change, it starts looking very like maintaining > > an indexed database table. > > > >> However, we should be able to do, say, hundreds of millions of > >> transactions in a day in real-time critical systems such as the FAA > >> or the stock market with data affinity and integrity and all the > >> other "ilities". If Outrigger cannot do this, it is of no interest > >> to us. > > > > The current record for a relational database doing simple transactions > > is 30 million transactions per minute (Oracle/SPARC TPC-C). Your mileage > > may vary, but there is no inherent limit on relational database scaling > > that puts a few hundred thousand transactions per minute out of reach. > > > > > >> MG > >> On Dec 14, 2010, at 9:48 PM, Patricia Shanahan wrote: > >>> Please clarify "machines a minute". Can you express it e.g. in > >>> transactions per minute? > >>> Patricia > >>> Mike McGrady wrote: > >>>> Linear - 7000 machines a minute, or more. Sent from my iPhone Michael >McGrady Principal investigator AF081_028 SBIR Chief > >>>> Architect Topia Technology, Inc Work 1.253.572.9712 Cel > >>>> 1.253.720.3365 On Dec 14, 2010, at 2:04 PM, Patricia Shanahan > >>>> <p...@acm.org> wrote: > >>>>> On 12/14/2010 1:49 PM, MICHAEL MCGRADY wrote: > >>>>>> On Dec 14, 2010, at 1:40 PM, Patricia Shanahan wrote: > >>>>> ... > >>>>>>> If we made a persistent version use a relational database > >>>>>>> to represent the space, > >>>>>> This would not be usable for us. This is too slow and does > >>>>>> not have the correct QCC features, especially scalability. > >>>>> ... > >>>>> Could you put some numbers on what sort of scalability you > >>>>> require? > >>>>> Patricia > >> Michael McGrady Chief Architect Topia Technology, Inc. Cel > >> 1.253.720.3365 Work 1.253.572.9712 extension 2037 >mmcgr...@topiatechnology.com > > > > > > >