On 7 Oct 2009, at 22:47, Alexander Klimetschek wrote:
On Wed, Oct 7, 2009 at 20:34, Ian Boston <i...@tfd.co.uk> wrote:
I agree, I would like to adopt sensible naming, but we keep on
hitting
situations where even with the most reasonable domain prefix we end
up with
2K items in a folder and then the update rates go through the
floor, and
contention and un mergable changes fall over. (usually just at the
worst
time possible... when load is highest )
In our case we often run out of things to slice before we reach a
position
where the store works. eg ieb i/ie/ieb gives 64 at level 1 which
generates
huge amounts of collision at level2 which again only has 64 making
the
maximum scale of somewhere around 4096*1024 items assuming a perfect
distribution before the bottom level folders breach 1024 children.
For
messaging for instance, I need a store that does about > 255^3 before
colliding, ie 16M *1024. Am I wrong to be choosing jcr as a
message store
to support this use case ?
I think you are really at the edge of scaling here. How many messages
are added per day? I'd think that date + maybe time (if there are more
than 2K per day) should balance it enough, for example. Organizing
messages by date is probably the best way anyway. And I guess they
won't change at all, only new ones are added, which also should reduce
contention to the node with the current time.
I agree that all of these structures help avoid the scaling issues but
IMO they miss two points that have been highlighted in our use of Sling.
I am only talking about URL's here *not* the path in JCR unless we are
forced to have a 1:1 mapping.
1. The URL space is part of the UI and "owned" by the User, UX
Designer, UI developer.
2. Imposing a convention on that URL space for the affordances of the
back end causes just the problem that you are concerned about. Now the
UI developer needs to know the internals of how to structure those
URLs to achieve scalability.
BTW, a UI developer does not write Java code. They use the REST
interfaces, they might write some py, esp or rb.
On 1, our Users, UX Designers and UI developers are demanding URLs
like /xxxx/yy where for all instances of yy, yy is unique, and yy
might be on of 1-200K and in some instances I know of upto 4M (the 16G
is an edge case but if I break out of the Higher Ed use case there are
plenty of examples of URLs where yy is one of billions). There are
two solid examples /user/eid where eid is the institutional ID and /
site/siteid where site ID the name of the Site, eg physics101.
These URLs *must* be speakable human to human. so /site/e4f3-de45-f345-
efe4 is not acceptable and /user/i/ie/ieb although just speakable will
remind our community of their institutional deployments of Andrews
File System, IMHO *not* a good thing as for many institutions it has
not been synonymous with scalability.
On 2. If we have to communicate how to structure the URL to UI
developers for storage, then it hardly matters what the scheme is, we
have to communicate it. An algorithm that says formatTime(now,"/{YYYY}/
{MM}/{DD}/") is almost as simple as formatSha1(pathInfo,"/{01}/{23}/
{45}/{67}") but I cant ask the UI developer to to do either. This is
not to say that they might not decide to structure the URL in a
semantic form, and I would encourage them to do so, but they always
come back to the case where there is a user generated URL space that
will have > 10K items at yy.
eg "What! you mean I can just put it at /site/xxx, I have to structure
the url, but that not what the users are saying they want, they want
to be able to decide what the url to their site is and, btw, they dont
like using /site they want /xxx you know like http://www.bbc.co.uk/radio4
" (I paraphrase a discussion of a few months ago)
If there is some other categorization of messages, eg. like the
project or group or whatever they belong to, you can put them in the
project's folder and then do the substructure via the dates. If you
give the messages a nodetype + other metadata as properties, you can
search them across projects or months/years.
Sounds like if JCR-642 was fixed, none of this would be an issue?
Not really. First of all it's not just a "fix", it requires a complete
rewrite of the internal persistence architecture in Jackrabbit.
Something for a 3.0 maybe (and there are various ideas how to do that
and also improve other bottlenecks).
But even if Jackrabbit scales with hundred thousands of child nodes
per node, you still have the problem of an unbalanced tree: it will be
hard or not to say impossible to browse that tree for a human - you'd
need a very advanced paging tree view to be able to go through that)
and just doesn't "feel" right. Well, at least to me ;-)
agreed a list of all nodes at yy is explicitly not supported, we use
search to provide a number of different hierarchies into that space
eg
date organized http://host/messages/yyyy/mm/dd.json
tag organized http://host/tags/sling-dev.json
with a default paging enforced just as any search engine does.
One point here is there are *multiple* views into the information set.
Sorry the message is so long, this is a real, possibly blocking issue
for us.
Ian
Regards,
Alex
--
Alexander Klimetschek
alexander.klimetsc...@day.com