Re: [Dhis2-devs] Reconstructing a categoryoptioncombo (long story)

2015-01-29 Thread Jim Grace
Hi Bob,

Good question. I like the idea of an in-memory cache for speed, as you
suggest. You might try using a HashTable where the key is an array of
option value Strings and the value of the HashTable is the optionCombo. As
you process the import, each time you get from the dataElement a
categoryCombo you haven't seen before, then get all the optionCombos for
this categoryCombo and put them into your HashTable. The order you put them
into the key array can be the same as the order of the
DataElementCategoryCombo.getCategories() method, since it returns a list.
When looking up a bunch of category values, just put them in the same order
into the array.

Obviously once you've built the values-combo lookup, you will want to
reuse it as much as possible. You could put this into a
com.google.common.cache.Cache so that it can be resued not only by
subsequent record in the same import, but by other imports that come before
the cache entry ages out. The only danger of this in theory is that someone
could extend a category combo or add new option values, and then try an
import before the cache expires. Although this is extremely unlikely, you
can protect against it: If a values-combo lookup fails, remove the cached
HashTable for this categoryCombo and rebuild it. If it still fails, then
you've got a real error. :)

Cheers,
Jim


On Thu, Jan 29, 2015 at 1:38 PM, Bob Jolliffe bobjolli...@gmail.com wrote:

 Hi

 Here's a problem.  Apologies, its a long mail, but its a serious business
 and needs to be untangled.

 Two or more systems have matching dataelements, categorycombos, categories
 and categoryoptions.  They could be matched on uid, name, code or what
 ever.  Assuming they also have matching orgunit identifiers, those two
 systems should be able to exchange data.  There is really no need for
 either of them to know anything about the other's categoryoptioncombos.
 Which is a good thing on a number of fronts.  Not least being that if
 either one of the two is not dhis2 then it won't have the faintest notion
 of a categoryoptioncombo anywat.  And even if they were both dhis2, we all
 know that keeping these catoptcombos in synch is notoriously difficult.

 So I've been over some of this ground before, but now thinking about
 implementation, there are some missing pieces in our model (and some
 shortcomings of the java language) which makes this a bit trickier than it
 should be.  Picture this datavalue being imported (using codes for
 legibility):

 datavalue dataElement='MalariaCases' sex='M' age='under5' . /

 1.  Once we know the dataelement we can immediately retrieve the
 categorycombo, which tells us to expect two more attributes: sex and age in
 this case.

 2.  We could go the database at this point and query from the
  categoryoptioncombos_categoryoptions table, having first retrieved the
 primary ids for the categoryoptions.  This would certainly work, but the
 table might be quite big and the query would be required many times for a
 large datavalueset.  Given that we know the categorycombo from 1 above, we
 should only need to query from a very much smaller set of data contained in
 an in-memory data structure.

 3.  But what would such a data structure look like?  Essentially what is
 required is a multidimensional associative array which is keyed along each
 of its dimensions using the categoryoptions of a category.  For most of our
 categorycombos this would be a 1 or 2 dimensional array, but with some
 rarer cases of 3 or 4 categories.  That would allow lookups of the sort
 getCatOptCombo(sex='M', age='u5', ...)

 Such a dynamic associative array is a natural paradigm in languages like
 perl, tcl, php, javascript, and probably R, but java leaves us a bit short.
 The structure is not easily expressed, at least not efficiently.

 4.  One alternative is to model it as a tree structure.  This has a minor
 drawback that a tree has to put the categories (the layers of the tree) in
 some order which is not implicit in our model, but that's not a very big
 problem.  If you know the order they were put in, you can use the same
 order to search them out.  A bit of xml below shows more or less what the
 structure of that tree would be like for a typical age-sex combo:

 categoryCombo name=bhj id=hjhkjkj code=kmjkl
 category name=sex 
 categoryOption name=Male 
 category name=Age 
 categoryOption name=under5 
 catoptcombo name=(Male/under5) id=767866/
 /categoryOption
 /category
 category name=Age 
 categoryOption name=over5 
 catoptcombo name=(Male/under5) id=ghuy8y/
 /categoryOption
 /category
 /categoryOption
 categoryOption name=Female 
 category name=Age 
 categoryOption name=under5 
 catoptcombo name=(Female/under5) id=767876/
 /categoryOption
 

Re: [Dhis2-devs] Reconstructing a categoryoptioncombo (long story)

2015-01-29 Thread Bob Jolliffe
Yes I've been thinking a bit about HashTables and hashMaps.  Effectively
that's the closest thing you get to an associative array.  And can simulate
a tree.  If we concede the ordering of keys then a simple
HashMapString,Integer would even do the trick if we concatenate the
categoryoption identifiers together, like for example (a 2 dimensional
categorycombo):

catcomboMap.put( HbIugXRzrcK.mBl2TUqeODx,43);
catcomboMap.put( HbIugXRzrcK.h5uidbruf8nJ,44);


where the value is the internal id of the catoptcombo, which is all we
need to perform the insert.

This is effectively a tree with keys on the branches :-) A more efficient
variant would be some sort of radix tree like what routers use to process
IP addresses, but maybe the above would be good enough.

We could leave these things around in a cache, but I doubt if its a huge
overhead to build up a few small maps during the import. One reason for
doing them on the fly is that you need to cater for different identifier
schemes being used.  So the above will work for uids, but if the import was
using names or codes you'd want to make different maps.

Agree with what you say about constructing them on encountering each new
categorycombo.  This is also what I was suggesting.

On 29 January 2015 at 22:07, Jim Grace jimgr...@gmail.com wrote:

 Hi Bob,

 Good question. I like the idea of an in-memory cache for speed, as you
 suggest. You might try using a HashTable where the key is an array of
 option value Strings and the value of the HashTable is the optionCombo. As
 you process the import, each time you get from the dataElement a
 categoryCombo you haven't seen before, then get all the optionCombos for
 this categoryCombo and put them into your HashTable. The order you put them
 into the key array can be the same as the order of the
 DataElementCategoryCombo.getCategories() method, since it returns a list.
 When looking up a bunch of category values, just put them in the same order
 into the array.

 Obviously once you've built the values-combo lookup, you will want to
 reuse it as much as possible. You could put this into a
 com.google.common.cache.Cache so that it can be resued not only by
 subsequent record in the same import, but by other imports that come before
 the cache entry ages out. The only danger of this in theory is that someone
 could extend a category combo or add new option values, and then try an
 import before the cache expires. Although this is extremely unlikely, you
 can protect against it: If a values-combo lookup fails, remove the cached
 HashTable for this categoryCombo and rebuild it. If it still fails, then
 you've got a real error. :)

 Cheers,
 Jim


 On Thu, Jan 29, 2015 at 1:38 PM, Bob Jolliffe bobjolli...@gmail.com
 wrote:

 Hi

 Here's a problem.  Apologies, its a long mail, but its a serious business
 and needs to be untangled.

 Two or more systems have matching dataelements, categorycombos,
 categories and categoryoptions.  They could be matched on uid, name, code
 or what ever.  Assuming they also have matching orgunit identifiers, those
 two systems should be able to exchange data.  There is really no need for
 either of them to know anything about the other's categoryoptioncombos.
 Which is a good thing on a number of fronts.  Not least being that if
 either one of the two is not dhis2 then it won't have the faintest notion
 of a categoryoptioncombo anywat.  And even if they were both dhis2, we all
 know that keeping these catoptcombos in synch is notoriously difficult.

 So I've been over some of this ground before, but now thinking about
 implementation, there are some missing pieces in our model (and some
 shortcomings of the java language) which makes this a bit trickier than it
 should be.  Picture this datavalue being imported (using codes for
 legibility):

 datavalue dataElement='MalariaCases' sex='M' age='under5' . /

 1.  Once we know the dataelement we can immediately retrieve the
 categorycombo, which tells us to expect two more attributes: sex and age in
 this case.

 2.  We could go the database at this point and query from the
  categoryoptioncombos_categoryoptions table, having first retrieved the
 primary ids for the categoryoptions.  This would certainly work, but the
 table might be quite big and the query would be required many times for a
 large datavalueset.  Given that we know the categorycombo from 1 above, we
 should only need to query from a very much smaller set of data contained in
 an in-memory data structure.

 3.  But what would such a data structure look like?  Essentially what is
 required is a multidimensional associative array which is keyed along each
 of its dimensions using the categoryoptions of a category.  For most of our
 categorycombos this would be a 1 or 2 dimensional array, but with some
 rarer cases of 3 or 4 categories.  That would allow lookups of the sort
 getCatOptCombo(sex='M', age='u5', ...)

 Such a dynamic associative array is a natural paradigm in languages