Ping?
On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa <[email protected]> wrote: > Hey everyone > > This is another one of those big changes that I have been thinking about > for quite some time. This email has a number of different proposals, all of > which add up to create this really simple system, with the same > functionality. > > Graph Introduction > --------------------------- > > For those of you who don't know about graphs in Nepomuk. Please read [1]. > It serves as a decent introduction to where Graphs are used. Currently, we > create a new graph for each data-management command. > > What does this provide? > ---------------------------------- > > We currently use graphs for 2 features - > > 1. Remove Data By Application > 2. Backup > > What all information do we store? > ------------------------------------------------ > > 1. Creation date of each graph > 2. Modification date of each graph ( Always the same as creation date ) > 3. Type of the graph - Normal or Discardable > 4. Maintained by which application > > (1) and (2) currently serve us no purpose. They never have. They are just > things that are nice to have. I cannot even name a single use case for it. > Except for they let us see when a statement was added. > > (3) is what powers Nepomuk Backup. We do not backup everything but only > backup the data that is not discardable. So, stuff like indexing > information is not saved. Currently this system is slightly broken as one > cannot just filter on the basis of not Discardable Data, as that includes > stuff like the Ontologies. So the queries get quite complicated. Plus, one > still needs to save certain information from the Discardable Data such as > the rdf:type, nao:creation, and nao:lastModified. Hence, the query becomes > even more complex. For my machine with some 10 million triples, creating a > backup takes a sizeable amount of time ( Over 5 minutes ), with a lot of > cpu execution. > > Current query - > > select distinct ?r ?p ?o ?g where { > graph ?g { ?r ?p ?o. } > ?g a nrl:InstanceBase . > FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) . > FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . } > } ORDER BY ?r ?p > > + Requires additional queries to backup the type, nao:lastModified, and > nao:created. > > Maybe it would be simpler if we did not make this distinction? Instead we > backup everything (really fast), and just discard the data for files that > no longer exist during restoration? It would save users the trouble of > re-indexing their files as well. More importantly, it (might) save them the > trouble of re-indexing their email, which is a very slow process. > > Also, right now one can only set the graph via StoreResources, and not via > any other Data Management command. > > ---- > > (4) is the most important reason for graphs. It allows us to know which > application added the data. Stuff starts to get a little messy, when two > application add the same data. In that case those statements need to be > split out of their existing graph and a new graph needs to be created which > will be maintained by the both the applications. This is expensive. > > I'm proposing that instead of splitting the statement out of the existing > graph, we just create a duplicate of the statement with a new graph, > containing the other application. > > Eg - > > Before - > > graph <G1> { <resA> a nco:Contact . } > <G1> nao:maintainedBy <App1> . > <G1> nao:maintainedBy <App2> . > > After - > > graph <G1> { <resA> a nco:Contact . } > graph <G2> { <resA> a nco:Contact . } > <G1> nao:maintainedBy <App1> > <G2> nao:maintainedBy <App2> . > > The advantage of this approach is that it would simplify some of the > extremely complex queries in the DataManagementModel. That would result in > a direct performance upgrade. It would also solve some of the ugly > transaction problems we have 2 commands are accessing the same statement, > and one command removes the data in order to move it to another graph. This > has happened to me a couple of times. > > --- > > My third proposal is that considering that the modification and creation > date of a graph do not serve any benefit. Perhaps we shouldn't store them > at all? Unless there is a proper use case, why go through the added effort? > Normally, storing a couple of extra properties isn't a big deal, but if we > do not store them, then we can effectively kill the need to create new > graph for each data management command. > > With this one would just need 1 graph per application, in which all of its > data would reside. We wouldn't need to check for empty graphs or anything. > It would also reduce the number of triples in a database, which can get > alarmingly high. > > This seems like a pretty good system to me, which provides all the > benefits and none of the losses. > > What do you guys think? > > [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts > > -- > Vishesh Handa > > -- Vishesh Handa
_______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
