Re: A new general method to update object databases / Should it be included in Java Data Objects, JDO ?

Tilmann Sun, 12 Jun 2016 04:12:49 -0700

Hi Heikki,

Cost of GC: I'm not sure I completely agree with your analysis.Databases also contain small objects, for these, GC can be aconsiderable overhead, effectively doubling the storage size. It is truethat memory/disk size have grow, but so have the amounts of data, endI/O is still often critical. There is especially the problem that objectthat are referenced need to be updated when a reference is added even ifthe object otherwise did not change. Depending on the scenario, this canhappen small or big impact on I/O.

Another performance problem may occur when the database is usedconcurrently: When modifying additional referenced objects, these needto 'locked' in a way, which may interfere with other concurrenttransactions which add/remove references to the same object (or simplywant to read it).

There is also a technical problem which I couldn't immediately see howyou solve it. Assume we have 3 objects in the database, A, B, and C,where A references B and B references C. A is root, B and C are embedded.

Now a transaction T1 loads A and prepares to delete it.
At the same time, T2 performs a query that returns B.
Then T1 commits.

What happens now? Is B and C deleted, invalidating transaction T2? Orare B and C retained, because T2 may create a reference to B fromanother root object?Maybe there is some information kept in memory that B has been loadedsomewhere else, so it is only deleted once it is known that no TXreferences it? But this could take a long time (the transaction T2 maylast very long, or T2 may crash and never abort, or another transactionT3 may perform a query that also returns B (or C).And with all these things in memory, what happens if the database servercrashes? How would the database file recover from such a scenario?

I think a big difference between in-memory-GC and DB-GC is that inmemory, once something is not referenced anymore, it can clearly bedeleted, because there is no way to reach it anymore (at least in Java).However, in a DBMS, data is always reachable via queries. Your GCalgorithm only checks whether data in the database is referenced, butcannot check whether someone may want to query it in future. Obviously,the intent to query data can be specified by making it root-objects, butwould that mean I should declare everything as root if I want to queryit at some point?

Btw, I ran across on old paper on O2, which discusses concurrent GC:http://www.vldb.org/conf/1997/P356.PDF

Incidentally, it also talks about white and grey objects :-)

There is quite some research if you google for "garbage collection o2database".

Generally, I see the point of using GC in databases (and so did theGemstone people and the ODMG people). I'm also sure there are caseswhere the ease of development is worth the additional I/O overhead.There are probably also scenarios where using is actually faster,because it offloads the GC to the server (assuming transaction patternthat cause little additional writes for updating the counters).

I'm also sure there are solutions to the concurrency and crash problem.

From personal experience (only two larger projects in industry), I havehowever seen that people can be very reluctant to have anything removedfrom their databases, let alone by some automatic removal algorithm. Themain point is to keep a history of data and, depending on the company,to provide legal traceability of processes and updates. This is onereason why many DBMS vendors support some kind of versioning for data.Instead of overwriting or deleting old data, they create new versions.So, at least in my personal experience, I haven't seen a project were agarbage collection feature would have been useful. Obviously myexperience is limited, but if a DB vendor had asked me, I would havegiven other features a higher priority than garbage collection.

I wonder weather these factors (performance of I/O & concurrency;requirement to keep data/risk of losing data) may be the reasons whygarbage collection is not widely implemented by DBMS vendors?

Maybe if you can convince a popular open-source DBMS to implement it (ora mapping layer such as Hibernate). If it becomes widely used, then itwould make more sense to put it into a standard such as JDO ...


This is my personal opinion of course, feel free to object :-)

Best regards,
Tilmann



On 05.06.2016 20:11, Hei Virk wrote:

Hi everybody,

-- Sorry the delay in my answer
First, I made slides (a PDF) about the embed method and how itsalgorithm works:
https://drive.google.com/file/d/0B8eKvcE21R9RMkNPYlZiOTEwZnM/view
I don't know if the Smalltalk community or other communities have thiskind of method (embed) but I believe not, because otherwise we wouldprobably know it!
It is true that gc is not a new concept in db world but but perhapsbefore it has been used quite separately in the db.
What is new in the embed method? I think that at least the high levelmethod/abstraction where the result of the embed operation is definedclearly between two memory systems (i.e. “the network levelunderstanding”). Also I think that in the next lower level binding theupdate part and the gc part together (“embed = update + gc”) is a newidea. And what is important in this binding is that the the updatepart is able to collect information for the gc part and the gc parthas full network/graph level understanding in db even it is donelocally and therefore efficiently.
Reference cycles (circular substructures) are not problem in thealgorithm as is described in the demo implementation. On the otherhand, gc can be done more easily when garbage (or db) does not containcircular substructures and both algorithms can be applied in realsystems, the simpler first.
It is true that storing reference counts takes some extra bytes innodes in db. That does not sound a big problem, because objects oftencontain quite much user data and memories are quite big today, even insmall devices.
Updating of reference counts is not a big performance problem in myopinion. If reference counts must be updated, it often means that youare making a big update operation when updating of reference counts isa small part of the operation. In addition, you can often updatereference count with the same SQL update statement (in O/R systems)which updates fields containing user data. (However in a purereference lose you are not updating the user data of an object.) Alsonote that it is possible to keep reference count information (sum ofincrements and decrements) in the run-time memory during the embedmethod. Therefore, if, for example, a node loses a reference and getsa new reference in an update operation, no updates are needed for thereference count. I demonstrate this in the demo implementation in thelevel of single (parent) node (the methodhandleReferencesFromGrayNodesInDB()). Perhaps in practice most updateoperations do not change references of objects. In these cases theembed method is, of course, a pure store/update operation.
Personally I think that the db model where the persistence is aconcept in the db, and is defined by reachability from persistent rootnodes is natural for programmers and designers, and very object dblike. Defining temporally, during a single update operation, thepersistence of a node or some nodes does not soundtheoretically/practically clear to me. Note also that when persistenceis defined by reachability, you can at anytime make a selected node inthe db as a persistent root node to be sure that it, and reachablenodes from it, will not disappear in future operations. This simplemethod is implemented in the demo implementation. It is incrORC(), theopposite is decrORC() (which is a general delete operation).
Just as Tilmann says I also think that the embed method has a solidtheoretical foundation(!). Therefore, in my opinion, the embed methodis not just a one new update method. If I think with the terms ofprotocols then, in my opinion, the embed method defines a new naturalhigher level layer between a client program and the db (remotememory). This layer is clean and makes updating a natural and easyoperation for a programmer; updating becomes nearly as easy asmodifying object structures in the run-time memory. And even thoughthe embed method fits best for complex update operations it is alsouseful in quite simple situations, for example, when a singlereference is replaced with a new one and perhaps some scalar fields ofsome objects are updated at the same time.
There are of course some worst, or tedious, cases for the plain embedmethod. However it is possible to derive additional “decorated”methods from the plain embed method. Also these methods are stillquite general and easy to understand for a user as network leveloperations. I describe examples of these methods (”handlings”) inNotes paragraph in my blog.
http://hvirkkun.blogspot.fi/2016/04/a-new-communication-theory-on-complex.html
Gc can be made even more efficient, and it can be helped easily if anapplication has extra knowledge of non-garbage nodes, as I explain inthe demo implementation. And, of course, embed can be called for anysubstructure, which increases efficiency of the operation. Thesefeatures tackle many inefficiencies and make the embed more versatile.However, still also simple update commands provided separately can beuseful; for example, if you only want to update some (scalar) fieldsof a single object etc. . As you noticed above, the delete operationis simply an operation which decreases the outer reference count of anobject and calls then gc for that object. Also I introduce in a Notesparagraph a ”forceDelete” command. It should be a very effectivedelete/killing command.
Something about the background of the embed method:
I invented/implemented it for a real purpose, not from a theoreticalperspective. Beforehand, I did not have clear understanding of whetherthe modified structure cpuld always be transferred back into thedatabase in a well defined way or not. And as I said, I believe thatneither other people had that understanding, which probably hasinhibited to think or formulate the whole problem and therefore thesolution can not be in any standard like Tilmann also seems tosuspect. -I soon understood that the problem is between two graphs andthat updating scalar data in nodes is only a side effect, butimportant for the users.
Like Andy says, if JDO supports many kinds of underlying db systems(which, for example, do not contain the concept persistence inside thedb) then it is difficult to make modifications for the standard.Perhaps branch or optional feature could be possible (??), I don't know.
I hope that some freeware groups would implement this method (plusrelated other stuff) and maybe some companies too. I believe that someday this is implemented in many systems because in my opinion theembed operation is very ”object like” and “object database like” ifyou allow the terms, and because the size of the database does notlimit the applicability of the method.
If here are representatives of companies or other instances/groupsreading this mailing list and who are wiling to apply these techniquesI could take part in the developing process and implementation if youwould like to hire me. I have still quite many ideas as I explain inNotes paragraph in my blog and also some other ideas. Some of them arein search/data mining side which is traditionally more easy side toapproach. And of course I have good insight in the embed method andnetwork/graph level things generally. Please find my contactinformation below.
Feel free to contact me if you are interested, or want to discuss thedetails more!
Best Regards,
Heikki Virkkunen
Email: hvirk...@gmail.com <mailto:hvirk...@gmail.com>
Phone: +358 40 706 3912 <tel:%2B358%2040%20706%203912>
On Fri, May 27, 2016 at 8:32 PM, Tilmann <zaesc...@gmx.de<mailto:zaesc...@gmx.de>> wrote:
    Hi Heikki,

    I found something in the old ODMG 3.0 standard:

    6.3.1.2 Object Deletion
    In the Smalltalk binding, as in Smalltalk, there is no notion of
    explicit deletion of objects. An object is removed from the
    database during garbage collection if that object is not
    referenced by any other persistent object. The delete() operation
    from interface 'Object' is not supported.


    -> So it seems that some kind of Garbage Collection was envisaged
    by the standard, if only for Smalltalk. However, as I said before,
    I'm not aware that this was ever implemented outside the SmallTalk
    community, and I don't know SmallTalk well enough to tell how much
    GC might differ from Java GC.

    So while the term 'GC in databases' doesn't seem to be entirely
    new, your approach may bring new ideas to the table, including a
    solid theoretical foundation. Maybe it would even explain why it
    hasn't been implemented in any major DBMS.

    For example, POET Navajo supports ODMG 3 but apparently chose
    _not_ to support GC (first page, last paragraph):
    
http://www.odbms.org/wp-content/uploads/2013/11/038.01-Jordan-An-Object-Database-for-Embedded-Environments-April-2000.pdf


    Regards,
    Tilmann


    On 20.05.2016 19:55, Hei Virk wrote:
    Hi everybody:

    I will comment later more.

    -I am just going to keep a presentation about the topic in my university
    here in Finland for theorists and I am making those slides at a moment..

    One social observation:
    In my understanding the situation has been that people (me too) have not
    been understood that that a modified object structure can always be
    returned back into the database in a deterministic and meaningful way. - I
    mean that the well defined result exists always for an operation. This is
    because the modifications in Run-Time memory can be so arbitrary.

    For this reason no one has been able to think the problem? Therefore the
    method can not have been in any db standard either.


    Regards,
    Heikki


    On Fri, May 20, 2016 at 7:49 PM, Tilmann Zäschke<zaesc...@gmx.de> 
<mailto:zaesc...@gmx.de>  wrote:
    Hello Hei,
    sorry, I posted my answer first only to the dev list, I'm not sure you are
    on it.
    So here it is again, with some updates:


    Hi,

    I haven't read the whole post in detail, but I understand the main idea is
    garbage collection in databases. This reminds me of similar concepts in
    osome object databases, such as GemStone/Smalltalk (
    
https://downloads.gemtalksystems.com/docs/GemStone64/3.3.x/GS64-ReleaseNotes-3.3/GS64-ReleaseNotes-3.3.htm).
    However, I'm not aware of any major Java-ODBMS that supports it.

    Some comments:

    - As Andy describes, it could be supported via a property of the PM.
    However, there may be scenarios when 'embed' and 'makePersistent' are
    regularily use in the same transaction, so if the feature would be widely
    used, it may make sense to have a dedicated 'embed' method.

    - as far as I am aware, the JVM has moved away from reference counting.
    The situation in an DBMS may be different to an in-memory JVM, but if an
    ODBMS implements GC, it may be worth checking out other garbage collection
    techniques. One problem are reference cycles, I haven't read the full
    blogpost how this is solved.

    - It seams that enabling garbage collection via reference counting will
    have quite some impact on performance, because objects get bigger (little
    impact) and because it requires updating of additional objects in the
    database to update the reference counters (the counters also need updating
    when _adding_ references). I would therefore argue that support for garbage
    collection should be determined at database creation time, or as a flag
    when defining database schema. This would allow avoiding the GC overhead
    when GC is not required.

    - We would need to be able to identify root objects which should not be
    garbage collected. This could also be done via a flag on the schema, i.e.
    every object of a class that has no reference counters is automatically a
    root object.

    Cheers,
    Tilmann



    On 13.05.2016 15:49, Andy Jefferson wrote:
    Hi,

    -I have invented a new declarative method to update object databases and
    I
    would like to introduce that method you.

    My question is:
    Should the functionality of this method included in some future
    versions of
    JDO (and perhaps some other definitions/systems under Apache)?
    Also I am interested about your comments/critique on the method.

    The name of the method is "embed" and it resembles the command
        pm.makePersistent(s);
    used in JDO-enabled applications. The embed method is called in
    equivalent
    way,
        db.embed(s);
    but it has a larger functionality than the current makePersistent
    method.
    My opinion/comments, others may have other thoughts :

    For something to be part of JDO, it needs to be widely applicable across
    other datastores, not just for one "type" of datastore. If it is only
    implementable on 1 type of datastore then standardising it would be
    "premature"; that isn't to say that it isn't applicable to others.

      From what I read of your method it is basically a makePersistent
    (attach) but with "deleteOrphans" enabled; i.e it addresses the object
    graph defined by the input object, and additionally deletes orphaned
    objects. Firstly it is entirely reasonable to be able to do that right now
    in terms of API (**assuming the JDO implementation provided the
    internals**) since the user can call pm.makePersistent and can have set
    properties on the PersistenceManager just before that call
    (pm.setProperty(...), so can set some "deleteOrphans" flag). From that I
    can conclude that I don't think there would be any change to the API needed
    to provide your mode of operation.

    Deleting orphans may vary dependent on the precise relation, hence why
    JDO allows metadata on each field, so some orphans can be deleted and
    others not.

    Clearly a JDO implementation would need to provide a "deleteOrphans"
    mode internally to support this.

    Detail of how your method works in an object database, whilst
    interesting, is not of relevance to the JDO spec question, since the spec
    doesn't define HOW an implementation implements features


    Regards

Re: A new general method to update object databases / Should it be included in Java Data Objects, JDO ?

Reply via email to