Just sharing some thoughts before I lose them ...

Been looking at the case of a certain West African country with 40000
orgunits.   Which is actually not a huge number of objects and we could
very likely be looking at orgunit tables bigger than this as villages and
even households start finding their way into the tree.

Our orgunit tree (the one-true-tree) is maintained using parentid pointers.
 This is simple enough to maintain and updates and insertions are efficient.

Of course updates and insertions are relatively rare.  What we need to do
much more frequently is selecting various subtrees.  All the facilities in
a district, all the districts in a province etc.

Also the depth of our trees are relatively shallow.  Most places seem to
have around 5 levels.  There is some trend that this starts to increase,
but not exponentially. We might conceive one day of 8 or even 10 levels but
not 100s or 1000s of levels.  Its this shallowness which makes the
_orgunitstructure table viable as the number of columns in that table will
always be within a practical limit..

Selecting for tree structures (traversal) which are built using 'parentid'
is not very efficient.  Postgres offers 'WITH RECURSIVE' which is pretty
cool if you can get your head around it, but not supported in mysql (and
not necessarily fast either).

So when selecting subtrees the _orgunitstructure table is our best friend.
 I use it for the mydatamart aggregatedXXvalue queries and it looks like
Lars uses it in the scheduled datamart job as well as well.   And it is the
obvious table to use when implementing filtering as I am now looking at re
mydatamart export.

The problem is (with this and the other resource tables) how to maintain
integrity.  Currently we generate this table on demand (from the user) from
the parentid pointers.  If users forget to do it then all sorts of things
fail.  There has been some discussion on list of generating this, at least
nightly which would be an improvement.  There would be some benefit in
maintaining it dynamically, ie "triggered" during those relatively rare
updates and insertions of new orgunits.  One possible consequence of this
approach would be that the parentid on the orgunit becomes effectively
redundant. It could be argued that its *only* current use is to generate
the _orgunitstructure table.

A consequence of the parentid being redundant is that hierarchy is
maintained in a separate table to the orgunits themselves.  And there is no
reason why there should be only one _orgunitstructure table.  There could
be any number, limited only by the number of hierarchies we needed to
maintain.  There's a thought ...

Of course the other use for the parentid is when you are exporting a bundle
of orgunits in a way which reflects the (or a) hierarchy.  In which case it
is really kind to clients to serialize this tree as a breadth-first
traversal ie. they come out in order which is quick and easy to rebuild on
the client.  That is what I am looking at doing now to try and help out our
mydatamart metadata export.  And I will of course use the _orgunitstructure
table to make this trivial on the server side.

But some principles derived from the above discussion:
1.  try to avoid using parentid directly in code ... always link to the
orgunitstructuretable. This might ease its eventual demise
2.  all places where we export collections of orgunits should be in a
 proper traversal - breadth first or depth firt shouldn't matter really but
I'm going to do the latter.

Bob
_______________________________________________
Mailing list: https://launchpad.net/~dhis2-devs
Post to     : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help   : https://help.launchpad.net/ListHelp

Reply via email to