I totally agree. In fact have a look at DataManagemenModel::Private::m_ignoreCreationDate.

As for the duplicate statements: that is a good point also. It greatly simplifies the removal of data per app.

On 01/03/2013 10:21 AM, Vishesh Handa wrote:
Ping?


On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa <[email protected]
<mailto:[email protected]>> wrote:

    Hey everyone

    This is another one of those big changes that I have been thinking
    about for quite some time. This email has a number of different
    proposals, all of which add up to create this really simple system,
    with the same functionality.

    Graph Introduction
    ---------------------------

    For those of you who don't know about graphs in Nepomuk. Please read
    [1]. It serves as a decent introduction to where Graphs are used.
    Currently, we create a new graph for each data-management command.

    What does this provide?
    ----------------------------------

    We currently use graphs for 2 features -

    1. Remove Data By Application
    2. Backup

    What all information do we store?
    ------------------------------------------------

    1. Creation date of each graph
    2. Modification date of each graph ( Always the same as creation date )
    3. Type of the graph - Normal or Discardable
    4. Maintained by which application

    (1) and (2) currently serve us no purpose. They never have. They are
    just things that are nice to have. I cannot even name a single use
    case for it. Except for they let us see when a statement was added.

    (3) is what powers Nepomuk Backup. We do not backup everything but
    only backup the data that is not discardable. So, stuff like
    indexing information is not saved. Currently this system is slightly
    broken as one cannot just filter on the basis of not Discardable
    Data, as that includes stuff like the Ontologies. So the queries get
    quite complicated. Plus, one still needs to save certain information
    from the Discardable Data such as the rdf:type, nao:creation, and
    nao:lastModified. Hence, the query becomes even more complex. For my
    machine with some 10 million triples, creating a backup takes a
    sizeable amount of time ( Over 5 minutes ), with a lot of cpu execution.

    Current query -

    select distinct ?r ?p ?o ?g where {
    graph ?g { ?r ?p ?o. }
    ?g a nrl:InstanceBase .
    FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) .
    FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . }
    } ORDER BY ?r ?p

    + Requires additional queries to backup the type, nao:lastModified,
    and nao:created.

    Maybe it would be simpler if we did not make this distinction?
    Instead we backup everything (really fast), and just discard the
    data for files that no longer exist during restoration? It would
    save users the trouble of re-indexing their files as well. More
    importantly, it (might) save them the trouble of re-indexing their
    email, which is a very slow process.

    Also, right now one can only set the graph via StoreResources, and
    not via any other Data Management command.

    ----

    (4) is the most important reason for graphs. It allows us to know
    which application added the data. Stuff starts to get a little
    messy, when two application add the same data. In that case those
    statements need to be split out of their existing graph and a new
    graph needs to be created which will be maintained by the both the
    applications. This is expensive.

    I'm proposing that instead of splitting the statement out of the
    existing graph, we just create a duplicate of the statement with a
    new graph, containing the other application.

    Eg -

    Before -

    graph <G1> { <resA> a nco:Contact . }
    <G1> nao:maintainedBy <App1> .
    <G1> nao:maintainedBy <App2> .

    After -

    graph <G1> { <resA> a nco:Contact . }
    graph <G2> { <resA> a nco:Contact . }
    <G1> nao:maintainedBy <App1>
    <G2> nao:maintainedBy <App2> .

    The advantage of this approach is that it would simplify some of the
    extremely complex queries in the DataManagementModel. That would
    result in a direct performance upgrade. It would also solve some of
    the ugly transaction problems we have 2 commands are accessing the
    same statement, and one command removes the data in order to move it
    to another graph. This has happened to me a couple of times.

    ---

    My third proposal is that considering that the modification and
    creation date of a graph do not serve any benefit. Perhaps we
    shouldn't store them at all? Unless there is a proper use case, why
    go through the added effort? Normally, storing a couple of extra
    properties isn't a big deal, but if we do not store them, then we
    can effectively kill the need to create new graph for each data
    management command.

    With this one would just need 1 graph per application, in which all
    of its data would reside. We wouldn't need to check for empty graphs
    or anything. It would also reduce the number of triples in a
    database, which can get alarmingly high.

    This seems like a pretty good system to me, which provides all the
    benefits and none of the losses.

    What do you guys think?

    [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts

    --
    Vishesh Handa




--
Vishesh Handa


_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

Reply via email to