Re: [PubSub] collection node definition

Andy Skelton Tue, 24 Nov 2009 07:43:49 -0800

On Sat, Nov 14, 2009 at 8:00 PM, Robin Collier <[email protected]> wrote:
> While quite powerful, doesn't this make a system quite custom in nature
> to the point where discovery of capabilities and configuration become
> quite useless?  I also implies access to the backend system to be to
> insert the custom logic.  In the end, wouldn't this only be useful in a
> very closed system.  It strikes me that this would not be too useful
> to an open system where you would not want the users to be able
> insert code on a server.


The WordPress.com PubSub service runs on a customized Ejabberd
installation. If we had wanted to create firehose (collection) nodes
for certain types of nodes (blogs and their comments) and not had the
ability to customize the code underlying nodes, we would have had to
rely on some form of Collection Nodes and we would have been stuck
with whatever support the software offered. Custom code might always
be the best way to implement systems with complex information.

> I guess I am thinking that this capability should be determined by
> server implementations as an extended capability, and not necessarily
> as part of the spec itself.

This is exactly what I'm thinking. I can see a spec being useful for
more generic systems but it almost certainly wouldn't be useful to me.
I'll describe our service as a use case.

tl;dr: pubsub.im.wordpress.com uses custom code to implement a complex
node tree.

WordPress.com pushes new blog posts and comments through a PubSub
service in Ejabberd. For the convenience of users familiar with
WordPress feeds, A.K.A. laziness, we decided to mimic the standard
feed URLs in our node names.

In WordPress the most commonly used feed URL path is /feed/ which
delivers an RSS feed of recent posts. The URL /comments/feed/ delivers
recent comments. Adding /feed/ after any post URL (/permalink/feed/)
fetches the comments on that post only. Appending /atom/ changes the
format. We mapped these onto PubSub node names by prefixing /blogs/
and dropping /feed/:

All posts: http://domain/feed/ -> /blogs/domain
All comments: http://domain/comment/feed/ -> /blogs/domain/comments
One post's comments: http://domain/permalink/feed/ -> /blogs/domain/permalink

A subscription to /blogs/domain/ would not be equivalent to a
subscription to each of its sub-nodes. We could have used a stricter
hierarchy with all items flowing down toward root but we felt a more
familiar scheme would be more user-friendly. (Of course we may be
proven wrong.)

We also have nodes that function sort of like collections without
their names being prefixes of the blog nodes. Maybe they would be
described better as compilation nodes. These two nodes push public
WordPress.com posts and comments to services like Collecta. This is
the mapping:

/blogs/*[private=false] -> /firehose
/blogs/*/comments[private=false] -> /gusher

Let me explain a little about /blogs/*. There are millions of blogs on
WordPress.com. Some of them are private, i.e. access-controlled. We
have no intention of implementing subscription or browsing of /blogs/
as a collection node. We don't even want to store its sub-nodes
(blogs) in the PubSub system if we can use WordPress.com as a back
end. Thus all of the blog nodes are virtual (they only exist while
performing a task). The instantiation of nodes is handled by custom
code that uses WordPress.com as back end via php_app[1].

The collection-ish nodes, /firehose and /gusher, are
access-controlled, notification only, and deliver payloads. These
high-volume nodes likely never will support item browsing. So we
figured the most efficient way to feed them was to send them items
internally from the most specific nodes. We don't bother showing the
origin node in a SHIM because our subscribers don't need it and
because each Atom item contains data sufficient to reconstruct the
origin node path.

Items are published to the comments, firehose, and gusher nodes
internally. For example, when a comment is published to
/blogs/domain/permalink (the origin) our virtual node module directly
calls the function that publishes the comment in
/blogs/domain/comments. Then it checks the blog's privacy settings via
php_app and if it's a public blog the virtual node module publishes
the comment in /gusher.

The main downside is that it would be a terrible task to port our
system, which I hope never to do. We could switch to a strict
collection node graph but custom logic would still be required to
handle privacy. I have not seen how the switch could be worthwhile.

Our custom nodes-as-code are working just fine. If anything, I'll just
rewrite the re-publication code to avoid data duplication. If you
would like more details on our implementation, please ask.

/tl;dr

Andy

[1] http://github.com/skeltoac/php_app/

Re: [PubSub] collection node definition

Reply via email to