On 7 Oct 2009, at 22:47, Alexander Klimetschek wrote:

On Wed, Oct 7, 2009 at 20:34, Ian Boston <i...@tfd.co.uk> wrote:
I agree, I would like to adopt sensible naming, but we keep on hitting situations where even with the most reasonable domain prefix we end up with
2K items in a folder and then the update rates go through the floor, and
contention and un mergable changes fall over. (usually just at the worst
time possible... when load is highest )

In our case we often run out of things to slice before we reach a position where the store works. eg ieb i/ie/ieb gives 64 at level 1 which generates huge amounts of collision at level2 which again only has 64 making the
maximum scale of somewhere around 4096*1024 items assuming a perfect
distribution before the bottom level folders breach 1024 children. For
messaging for instance, I need a store that does about > 255^3 before
colliding, ie 16M *1024. Am I wrong to be choosing jcr as a message store
to support this use case ?

I think you are really at the edge of scaling here. How many messages
are added per day? I'd think that date + maybe time (if there are more
than 2K per day) should balance it enough, for example. Organizing
messages by date is probably the best way anyway. And I guess they
won't change at all, only new ones are added, which also should reduce
contention to the node with the current time.


I agree that all of these structures help avoid the scaling issues but IMO they miss two points that have been highlighted in our use of Sling.

I am only talking about URL's here *not* the path in JCR unless we are forced to have a 1:1 mapping.

1. The URL space is part of the UI and "owned" by the User, UX Designer, UI developer. 2. Imposing a convention on that URL space for the affordances of the back end causes just the problem that you are concerned about. Now the UI developer needs to know the internals of how to structure those URLs to achieve scalability.

BTW, a UI developer does not write Java code. They use the REST interfaces, they might write some py, esp or rb.

On 1, our Users, UX Designers and UI developers are demanding URLs like /xxxx/yy where for all instances of yy, yy is unique, and yy might be on of 1-200K and in some instances I know of upto 4M (the 16G is an edge case but if I break out of the Higher Ed use case there are plenty of examples of URLs where yy is one of billions). There are two solid examples /user/eid where eid is the institutional ID and / site/siteid where site ID the name of the Site, eg physics101.

These URLs *must* be speakable human to human. so /site/e4f3-de45-f345- efe4 is not acceptable and /user/i/ie/ieb although just speakable will remind our community of their institutional deployments of Andrews File System, IMHO *not* a good thing as for many institutions it has not been synonymous with scalability.

On 2. If we have to communicate how to structure the URL to UI developers for storage, then it hardly matters what the scheme is, we have to communicate it. An algorithm that says formatTime(now,"/{YYYY}/ {MM}/{DD}/") is almost as simple as formatSha1(pathInfo,"/{01}/{23}/ {45}/{67}") but I cant ask the UI developer to to do either. This is not to say that they might not decide to structure the URL in a semantic form, and I would encourage them to do so, but they always come back to the case where there is a user generated URL space that will have > 10K items at yy.


eg "What! you mean I can just put it at /site/xxx, I have to structure the url, but that not what the users are saying they want, they want to be able to decide what the url to their site is and, btw, they dont like using /site they want /xxx you know like http://www.bbc.co.uk/radio4 " (I paraphrase a discussion of a few months ago)





If there is some other categorization of messages, eg. like the
project or group or whatever they belong to, you can put them in the
project's folder and then do the substructure via the dates. If you
give the messages a nodetype + other metadata as properties, you can
search them across projects or months/years.

Sounds like if JCR-642 was fixed, none of this would be an issue?

Not really. First of all it's not just a "fix", it requires a complete
rewrite of the internal persistence architecture in Jackrabbit.
Something for a 3.0 maybe (and there are various ideas how to do that
and also improve other bottlenecks).

But even if Jackrabbit scales with hundred thousands of child nodes
per node, you still have the problem of an unbalanced tree: it will be
hard or not to say impossible to browse that tree for a human - you'd
need a very advanced paging tree view to be able to go through that)
and just doesn't "feel" right. Well, at least to me ;-)


agreed a list of all nodes at yy is explicitly not supported, we use search to provide a number of different hierarchies into that space

eg
date organized http://host/messages/yyyy/mm/dd.json
tag organized http://host/tags/sling-dev.json

with a default paging enforced just as any search engine does.

One point here is there are *multiple* views into the information set.


Sorry the message is so long, this is a real, possibly blocking issue for us.

Ian


Regards,
Alex

--
Alexander Klimetschek
alexander.klimetsc...@day.com

Reply via email to