Ok, so this discussion is starting to be fruitful. I am learning things about how clerezza works, and the reasons for the decisions made, and I understand your point of view. This has lead me to a proposal below that I think could satisfy both your requirements and mine in a way that is flexible and extensible for all future uses.
On 1 Jun 2011, at 01:52, Reto Bachmann-Gmuer wrote: > On Sun, May 29, 2011 at 7:06 PM, Henry Story <[email protected]>wrote: > >> >> On 26 May 2011, at 20:31, Reto Bachmann-Gmuer wrote: >> >>> With CLEREZZA-540 I suggest a GraphNodeProvider-Service that returns a >>> GraphNode given a named resource. Mainly this code that used to be in >>> DiscoBitsTypeHandler which has been generalized. >>> >>> The issue is described as: >>> "Implement a platform service that returns GraphNodes for URIs. The >>> GraphNode is the resource identified by that uri with as BaseGraph sources >>> considered authoritative for that resource. " >>> >>> Of course "considered authoritative" it not a very sharp description. The >>> issue is labeled with "platform" which implies it is not a generic utility >>> of clerezza.rdf but that it relies on platform default graphs. >>> >>> The solution proposed in commit >>> #1125477<http://svn.apache.org/viewvc?view=rev&rev=1125477>and >>> #1125652 <http://svn.apache.org/viewvc?view=rev&rev=1125652> sets the >>> basegarph as follows: >>> - always trust the content graph >>> - for remote resource trust the graph you get by dereferencing the uri >>> - for resources in the user-uri space trust that user >> >> Of course what one thinks of this patch depends completely on how this >> provider gets used. It has quite a lot of limitations it seems to me as >> implemented >> currently - which is of course not a final failing in a developing project, >> but it does seem like a good discussion would help us narrow perhaps on >> better solutions or on fixes, which is part of the reason I thought the ISSUE >> should not yet be closed. >> > Keeping the issue open blocks development, an issue is associated to a patch > as long as an issue is open other code should not depend on code committed > under this issue. yes. I understand this now. Sorry for putting code into clerezza earlier and not trying to fix it immediately. I have open issues in other open source projects that have been worked on for over a year. For example the issue in Chromium http://code.google.com/p/chromium/issues/detail?id=29784 > This is way an issue should be closed shortly after the > patch has been committed, and conversely why code shouldn't be kept in trunk > it the issue cannot be closed. An open issue is more than a discussion > reminder. Yes. But one has to also leave enough time for debate on an issue before it gets closed, if there is an issue that is raised. Here I was arguing that the combination of 540 and 544 is not right. 544 confirms my doubts about 540 because it shows how it is intended to be used. The aim of having a discussion is to help find the best answers to a problem. >> >> The failings I mentioned in the ISSUE-540, and develop below are: >> >> 1. it to relies on more and more Services that each require TcProviders. >> This >> feels very ad-hoc. The ones I mentioned are >> >> - UserManager: requires a TcManager >> - WebIdGraphsService: also requires a TcManager >> - PlatformConfig: requires TcManager >> - ContentGraphProvider: requires TcManager >> - TcManager >> >> Each of these is used, and makes a call to the database. And the >> TcManager itself each time >> iterates through a number of TcProviders. >> > Yes, TcManager is the main entry point to the rdf data. I don't see any code > smell here. I a classical RDBMS java application a component will typically > use jdbc and rely on other components that use jdbc, here it's TcManager > instead of jdbc, The thing to do would be to look at the number of connections that are generated to the DB. My feeling is that one could do the same and have only 1 or 2 connections to the DB. Here we could quickly end up with 10 times more.... But even if I am wrong about this, I think there is a better solution for how we can do what you want in a way that would satisfy both of us. I detail that below. >> >> 2. when asking for an external URI, you get the whole content graph too >> > You get a GraphNode > > >> On my fresh install of ZZ that is 20 times more information than the >> initial graph. >> > - What is 20 times more? > - What do you mean by "get"? > > A graphnode point to a resource and is designed for browsing from resource > to resource. It is not a graph but a node in a graph. The object is > associated to a base-graph which used to identify the propertied and > instanctiate another graphnode when hoping to a property value. The > underlying graph could for instance be Timbl's GGG (giant gloabl graph aka > the web). yes ((but clearly you don't want to dereference the whole web when working...)) Also there will be contradictions in the information on the web. Some people may trust some graphs, other trust others. Graphs can be merged easily in RDF - IF they are believed both to be true. But what is believed to be true will depend on what possible world you believe yourself to be in. I argued this in "Beatnik: change your mind" in more detail, if that helps for people following this discussion http://blogs.oracle.com/bblfish/entry/beatnik_change_your_mind From the point of view of WebID and security I want to be able to tell WHO said what. In many applications being able to be very clear about where something was said is going to be essential to giving good feedback. Some example coming from the field I am working on below. So for a foaf-browser, I want to know when TimBl declares someone to be a friend, and differentiate that from when someone declares himself to be a friend of TimBL, which is a very different thing. When I get Dan Brickley's graph I may want to know all the people he mentions in his foaf profile - even if he does not mention them as foaf:knows related to him. If the GraphNodeProvider returns a union graph of the documentation graph, content graph,... and his foaf profile then when searching for all the foaf:Person I will get the documentation writers too, the writers of content in the content graph, and who knows what else... many people will have no direct relation to Dan at all. People can say true things about Dan but those not be things Dan himself would say. I believe these use cases are not limited to the foaf browser but to a very large category of semantic web applications. Give me some linked data application, and I will easily come up with use cases of the same kind. >> How big is that going to become as one's content graph grows over time? > > The underlying graph would be the size of the virtual content graph + the > size of the remote graph - duplicate triples > >> Is this not going to create a huge bottleneck very quickly? > > No, the time complexity for accessing a property of graphnode grows linearly > to the number of graphs in the graph-union (which in this case is 2) but is > only O(log n) for the number of triples in the indexed triple store. > > I thought I had heard people mention issues with speed on this list. >> > You may check archives of this list at: > http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/ Do you have a precise thread? > > I mentioned performance concern directly to you when I saw that your > EasyGraph class copies to memory all triples of a TripleCollection rather > than accessing them in the original graph. So yes, the performance > implications of our design choices should always be in our minds. That is perfectly good criticism. I think you may have patched it even... If not I should definitively fix that. > >> >> So to verify this do the following: >> >> zz> import org.apache.clerezza.platform.graphnodeprovider._ >> zz> val gnp = $[GraphNodeProvider] >> zz> val tbl = gnp.get(new UriRef(" >> http://www.w3.org/People/Berners-Lee/card#i")) >> zz> tbl.getGraph.size >> res0: Int = 1878 >> >> getGraph returns "the graph the node represented by this instance is in", > as I mentioned before this could be GGG. On a GraphNode you should usually > not invoke getGraph, the object exists to hop from resource to resource. (Well I thought it reveals how much information is being shipped around, but apparently not as you argue below) > >> If I get Tim Berners Lee's Graph on the command line I find >> >> $ rapper http://www.w3.org/People/Berners-Lee/card | wc >> rapper: Parsing URI http://www.w3.org/People/Berners-Lee/card with parser >> rdfxml >> rapper: Serializing with serializer ntriples >> rapper: Parsing returned 78 triples >> 78 380 9978 >> >> So here we have 78 triples, but the resulting answer schlepped around is 20 >> times bigger - on a new installation! >> > If by schlepped around you mean that they are copied in memory or something > like that this is not the case. It's a reference, that's all. Ah ok. Some graphs copy all the information into a set, others work by reference... I had not cottoned onto that. The ones I have been using were based on Hashsets. So that is less of a problem then. > > >> >> I wondered what was going on, because to my surprise on a new installation >> the Content Graph contains only 1 triple. >> So I looked into a running instance of ContentGraphProvider and found that >> the additions array contained the following graphs in addition to the >> content graph: >> >> - <urn:x-localinstance:/documentation.graph> 1002 triples >> - <urn:x-localinstance:/config.graph> 176 triples >> - <urn:x-localinstance:/web-resources.graph> 621 triples >> - <urn:x-localinstance:/enrichment.graph> 0 triples >> >> So that does then indeed add up to the number. >> > Great, you found the answer :) > > >> >> What I am wondering is in what cases is this needed? It seems like this may >> indeed what a particular application may require, but does it have to be >> a general service? The name certainly suggests a very general service, not >> one required for a particular application. >> > This is about ContentGraphProvider then, not about issue 540. It's the > ContentGraphProvider which provides the graph of instance-wide and public > information for the platform 540 the GraphNodeProvider delegates decisions to the ContentGraphProvider 544 uses the GraphNodeProvider that delegates to the ContentGraphProvider and ties it into the the core, so that when a JSR311 class requests a named graph it then gets whatever the ContentGraphProvider decides is trusted content. SKETCH OF A COMPROMISE SOLUTION =============================== Perhaps there is a way to allow make things more transparent, by for example having JSR311 classes that really want the full union returned by the current ContentGraphProvider to have that, and other applications to get something more limited. I suggest that we think of naming the content graph or at least build something so that those who need the content graph can ask for it clearly, and those who don't can make sure they don't get it. - It may be useful then to name the content graph - it would be a union graph that could be specified by a SPARQL UNION query for example or the equivalent. - have a JSR311 class return a NamedGraphNode (or something like that) which can then call the CallbackRenderer. The NamedGraphNode could name the union graph, or could be a union graph - by reference. So all the code would do is return new UnionNamedGraphNode(new UriRef("urn:x-localhost/contentGraph"),new UriRef("http://remote.example/resource/")) or some nice syntactic sugar for that. For example for apps requiring the union of Content + other that you need, something like the following would be neat return new ContentGraphNodePlus(UriRef("http://remote.example/resource/")) These objects would just be holders of the graph name(s), which the TcManagers can then hook up into the underlying triple store. Something along those lines would be very nice. One could easily write applications that get union of contents as you wish, and I could easily get very precisely defined graphs for security based application, or more flexible linked data graphs too. It could also avoid the iterative way the GraphNodeProvider currently works. Having something like that would mean that perhaps the addition of the new method in CallbackRenderer public void render(UriRef resource, GraphNode context, String mode, OutputStream os) throws IOException; would no longer be needed, or would be adapted somewhat. It could also mean that the GraphNodeProvider could be a lot more general, as its name indicates it should be. The information about graphs hard coded into the provider could then be moved to a Graph (or GraphNode or NamedGraph or NamedGraphNode object). It would then be a lot clearer when looking at JSR311 code what was being returned. [[ ps: a thought One could perhpas write implementations of such a NamedGraph that would perhaps allow links to be followed outward (from accepted named graphs to others graphs it links to, up to a certain number of hops). ]] >> Perhaps changing the name from GraphNodeProvider to >> ContentGraphPlusOtherProvider would make more sense. > It's a platform service that provides GraphNodes. Being a platform service > implies it usesthe platform means of getting trusted content. If it would > just dereference URIs the it would probably be placed in a subpackage of > clerezza.rdf. perhaps. But why not make things nice and general as explained above? Currently with changes to 544 and in particular the render method public void render(UriRef resource, GraphNode context, String mode, OutputStream os) throws IOException; when a JSR311 class returns a URI, the renderer does not get the graph named by that URI but that graph and something else, defined in some unrelated package. For me this does not make it easy to understand the code. >>> This might not match an intuitive understanding of "authoritative" and >> I'm >>> happy to redefine the issue so that no confusion arises. >> >> One thing I am not quite clear about yet, is who writes to the content >> graph? I see a lot of modules use it. >> > Modules can write to the content graph or add temporary additions to it. > Actually writing to the content graph should happen when public and trusted > information is added. An information is considered trusted when added by a > user with respective permission or verified by privileged code (e.g. that > allows the public to add see-also references). Good so say a trusted user of mine :joe truthfully says b:danbri foaf:knows :joe . then currently when I ask for http://danbri.org/foaf.rdf#danbri I will get a graph that contains the above triple even if danbri does not make that claim. Sometimes that is good, and sometimes not. In many cases as I have argued it will be important for me to know what danbri claims. Perhaps so I can ping him to tell him about my desire for him to claim friendship with me. In the current API changes it won't be clear at all why when I ask for <http://danbri.org/foaf.rdf#danbri> I get <http://danbri.org/foaf.rdf#danbri> + 5 other graphs. Or it will require the developer to know the internals of clerezza to work this out, as I have just had to do myself. > > >> >>> >>> What I do strongly believe is that the proposed patch offers a major and >>> very useful new functionality. Especially as it allows the following >>> features to be implemented: >>> - Thanks to CLEREZZA-544 one can call the render-method to delegate the >>> rendering of resources with a UriRef instead of a resource, >> >> I think you mean a "UriRef instead of a Graph". >> > "UriRef instead of a GraphNode" (GraphNode is rougly what is called > "Resource" in the jena api) > >> >> Yes, that makes sense. But why does the GraphNodeProvider have to cast >> such a wide net to catch so many triples? It seems to me that if one >> is to use a URI then it would be better that the URI refer precisely to >> that named graph (or to a node it it). One could use other tools to create >> virtual graphs, like Simon Schenk's Networked Graphs I mentioned >> >> http://blogs.oracle.com/bblfish/entry/opening_sesame_with_networked_graphs >> >> These allow one to have virtual graphs depending on a SPARQL query pattern. >> There it would be easy for different services to specify different ones. >> And I think something like that would be really good to have. >> > > The possible GGG base graph could be implemented exactly as you describe, > but using sparql would probably be quite an inefficient approach. Good so I have no problem in naming such a base graph, as long as the JSR311 code and can clearly specify to the CallbackRenderer what it really wants. > > > >> >>> in this case the >>> resource is rendered using its own baseGraph rather than the one of the >>> calling template. An example usecase for this is rendering the author of >> a >>> comment, the whole profile of the (possibly remote) commenter isn't and >>> shall not be part of the baseGraph of the GraphNode returned by the >> jax-rs >>> resource method, yet for rendering the comment-author infobox it might be >>> beneficial to render a GarphNode with a baseGraph containing all of the >>> information in the users profile-document >> >> But why also all the information from the documentation and the config >> graphs? >> It may be useful in some very limited cases, but it may mostly not be. It >> seems that >> some use cases would be useful to help describe this in more detail. >> > > The config graph will seldomly be in the reachability graph of a resource > but if it probably make sense of having it there. > > > >> >>> - With CLEREZZA-541 the GraphNodeService is accessed from TypeHandler, I >>> posted a resolution to this issue because it was already quite there on >> my >>> local machine when Herny reopened CLEREZZA-540, to respect the reopening >> of >>> the issue I didn't mark the dependent issue as resolved. I will of course >>> revert the changes if requested to do so by a qualifying -1. >>> >>> I'm not arguing that my patches solve all issues one might have around >>> getting resource descriptions but I do think it is very valuable and to >>> allow to base other stuff on this service I would like the issue to be >>> closed. As Henry reopened the issue twice and I don't want to close the >>> issue again without a broader discussion. Yet as many thing depend on the >>> issue leaving it open doesn't seem an option to me. >> >> What depends on it is something you are wanting to do in your projects it >> seems to me, and that is not that clearly laid out. > > Also because of the negative expirience with the monster issues recently I'm > trying to be very explicit on dependent issues. A motivator for this issues > was also your code around browsing remote foaf-profiles. yes, the foaf browser is a very simple example linked data application. I thought it would be good to have one in zz to help bring out these issues. > > >> Because it does not seem obvious to >> me why a service should make the decisions this one does about what is >> authoritative. >> > I mentioned earlier that other trust settings might be added with other > issues. You seem to agree that it is useful to assume the graph resulting > from dereferencing the resource as authoritative. The content graph is > authoritative by platform conventions. There is no specific decision of the > service here Ok. Again I think it will be very confusing to bake this into clerezza the way it is done now. It confused me a lot, and this thread is a testament to that. > >>> >>> Future enhancement might include: >> >>> - manually force refresh of caches for graphs related to a requested >>> resource >> >> Yes, indeed. But why here, when it is not in the WebProxy? You would think >> cache >> update functionality should go in the WebProxy right? >> > Because when you access resources you don't (usually) care about the > underlying graphs. You can already forces cache-refreshes when using the > WebProxy. Ah ok. So this is not the core issue here. > > > >> >> >>> - force an alternative set of baseGraphs to be used (e.g. Only local or >> only >>> remote sources) >> >> What I am wondering is why all this is done like this? If I go over the >> changes of the >> past few weeks this is what I see: >> >> So if we go over the history of refactorings that led us here. >> >> 1. You did not like the initial WebProxy you I wrote by refactoring your >> WebIdGraphsService. >> Neither did I in fact - but it did work at least and added minimum >> change - being new to ZZ >> I did not want to play around too much in the internals. >> > well your code massively expanded the rdf api. Compare this with the length > of discussion which is just about one method and one interface in the higher > level platform api. Ok. In any case it lead to the WebProxy being a core component which is good. It was a mistake for me to sidetrack this discussion there too. > > > >> 2. You moved the old WebProxy to what seemed like a nicer interface: the >> TcProvider interface. And >> indeed that does look a lot better. BUT but this interface is really >> meant for direct, no interpretation >> access to the database and so lack >> - key notions of caching (well I suppose they could make sense even for >> other sesame or jena graphs?) >> > No I don't think this would make sense for jena or sesame graphs. I think > that most clients don't need to force custom caching/update policies, but if > they do they can access the WebProxy services which offers methods not > available via TcManager ok > > >> - does not provide a method for returning the final name of the graph >> (for redireted resources, or foaf:knows), >> when the WebProxy gets called >> (since this the TcProvider assumes you give it exactly the correct >> name of the graph) >> => So really it is quite uncomfortable there somehow. >> > This is something you could open an issue about, something like "Graph > aliasing in TcManager/TcProvider" ok > > > > >> 3. This led you then to move to this GraphNodeProvider in order get a graph >> from a URI - which is very similar to the TcProvider in many ways, right? > > No, its something fundamentally different > > It even uses the code of the original WebProxy to do a HEAD >> on a remote resource to find the graph name (and which one would assume >> would be part of the WebProxy >> code since it will be making the real HTTP Connection, and so can follow >> the changes of the graph names.) >> > So what? (do you want to hear that your name is in the class comment or that > this particular code comes from the WebId-service which you extracted when > introducing your WebProxy) No. But I'll leave this discussion here as this is turning out to be a side issue. > > > >> But because the TcManager interface is really a database layer interface, >> that cannot be placed there, and >> so is now placed into something outside - this class you have now >> written. >> > I don't understand. I think the possibilities the new storage.web service > offers are quite cool. For example that you can do sparql-queries on remote > graphs using the clerezza endpoint (if you have appropriate privileges). You > can browse resource in remote graphs as described here: > http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201105.mbox/%[email protected]%3E yes, that is cool. The issue here is related to CLEREZZA-538 "cached graphs can have names with #", so that if someone writes a val planetRdf = $[TcManager].getGraph("http://planetrdf.com/index.rdf".uri) which leads him to follow a hash uri there could be an exception being thrown. Anyway, lets move that discussion to CLEREZZA-538. > > >> 4. But instead of just having a GraphNodeProvider that just returns the >> graph, you have added some twists to >> it and return more than jut the named graph. There is nothing to say that >> a named graph cannot be the union >> of many other graphs, but it seems really arbitrary for me to get the >> documentation of clerezza along with the >> triples of Tim Berners Lee's graph. >> > >> Somehow things have gone a bit haywire at the end here. > > If you call getGraph on a GraphNode you're leaving the scope of the > GraphNode. Probably all this discussion would not be necessary if had been > using getNodeContext instead of getGraph. The NodeContext is what related to > the node. Using getGraph is a bit like doing the following: > > File file = SomeService.getFileDescribing("Tim Berners Lee") > file.getParent().getParent().getParent().listChildrenRecursively() I don't think that is a good way of looking at what graphs are useful for. Graphs are more like bubbles in a comic strip. I argue this very carefully in "Are OO languages Autistic?" http://blogs.oracle.com/bblfish/entry/are_oo_languages_autistic This is a fundamental new programming element provided in the semantic web. So the context as you are defining it is not what I am looking for. I am really looking for the named graph - the entire claim made by a resource. This can be seen by considering the example I gave above where someone adds to the content graph information about Dan Brickley b:danbri foaf:knows :joe . If I only get Dan Brickely's graph back that triple will not be there. If I get Dan Brickley's + the content graph, then that information will appear even if I just ask for dan's node context. Also there may be information about people appearing in Dan Brickley's profile that are not directly linked by him, that I will also be interested in retrieving. So the context is not the tool I need - and I don't think my use cases are special. > > The listed files can contain thigs that are completely unrelated to Tim > Berners Lee > > > >> And I think this is due to a bit of confusion of the needs >> of your application with trying to keep the general architecture clean. >> > As I said, I did not made this particularly for an application, my wall > application is merely a demo. When we want to do something like a > foaf-browser we want to be able to display the resource in their context, > just a usecase. Ok, so that is where our disagreement lies. The node context is in many case not at all what we want. It both adds too much information and not enough. It may be that in the wall demo that is not visible. But in security matters and trust matters it will make a big difference. > > >> >> Now on the whole I have learnt a lot about Clerezza by following this, >> but I just can't say that this looks like >> a good long term solution. We are constantly moving around and around >> something. >> > This is your impression. I hope my explanations to the concrete points you > mention could help changing this impression. I think it should now be clear how we can come to a solution that satisfies both our needs. > > > >> >> Would any of the following work? >> >> - TcProvider extended to specify caching options? >> > Unrelated to the issue. But no, I think only storage.web needs those options ok. > > >> - Graph to be extended so that it can contain its name (so that one can >> ask for a resource in a TcProvider, >> and find out what its name really was by inspecting the resulting graph) >> > Again, his is unrelated to the issue at hand. > > Having triple collection-aliases with primary names accessible via TcManager > would be fine, but I would not agree at having them as part of the > TripleCollection ok, so perhaps we should focus on this and on the proposed solution above. > > >> -> if not, should WebProxy really be a TcProvider? >> > Since there is no way of knowing ahead of time what the name >> of a graph for a resource is, given that redirects can occur at any >> time. >> >> The WebProxy as TcProvider mostly makes sense otherwise, so it does feel >> like the above two things would help. >> >>> >>> So I'm asking you to kindly review the proposed code and vote about >> closing >>> CLEREZZA-540 >>> >>> [ ] +1, I agree with accepting the proposed code into trunk >>> [ ] 0, I don't care >>> [ ] -1, I don't want this code in trunk (must specify a technical >>> explanation, please also specify what would have to be changed for the >> patch >>> to be acceptable to you. >> >> -1 for the moment on closing the issue. (not on removing the code) >> Please answer the above points carefully. >> > -1 are against code, keeping the code in trunk if you can't accept makes > little sense to me > > > Okay, I see two reasons which could qualify as technical reasons: > - The service returns huge amount of triples: this is just wrong as it > returns graphnodes ok. > - The class should be named ContentGraphPlusOtherProvider instead of > GraphNodeProvider: As It doesn't provide a Graph but a GraphNode your name > seems wrong rather than just imprecise. A precise name might be > GraphNodeBasedOnContentGraphPlusOtherProvider. > > Would the rename be okay for you to accept the proposed path? (I really > would like to go back to productive work, so I rather have a horrible name > than seeing the project stalled by your veto). Well then the issue would be why this class should appear in the CallbackRenderer. No I think there should be a way from JSR311 code to ask to ask precisely for the type of GraphNode it wants with very little coding. So that for the use cases where walking the content graph is the right thing to do it is one line of code, and for cases where something more precise is needed it is also just one line of code. In any case it should be easy when reading the code to understand what is going to be displayed. I hope this helps, Henry > > Reto > > PS: You seem to be extensively using you're right to veto while ignoring > other's veto on your code, looking at > https://issues.apache.org/jira/browse/CLEREZZA-515 I see that the commits > have not been reverted even more than one week after my veto and request to > revert. Hmm, I did revert that using git. But I am not sure why that does not appear in the commits for that issue.... I see you brought that up in another thread. Social Web Architect http://bblfish.net/
