Does anyone have advice about how to tackle a problem where sparse arrays
would be a good implementation in principle, but not in practice?

This particular problem comes from a group of colleagues that compiles
statistics.
In the proposed data model there will be about 13 dimensions.

The cardinalities of three of these dimensions is around 400.
These dimensions represent countries - either individually or grouped.
The remaining dimensions have cardinalities of between 3 and 30.

The data is very sparse - probably only 3 dimensions will be dense.
None of the high cardinality dimensions are dense.

Time period,(i.e. year and month) is additional dimension but does not
present an issue because data for each period can quite naturally
be held in its own file.


The types of operations are simple -
(i) storage and retrieval of selections for display in Excel etc
(ii) totaling and subtotaling up most of the dimensions (e.g. aggregating
countries).

A J-sparse array implementation would have 10 sparse axis.
This means that for every observation there would 10 extra numbers (i.e.
integers).
i.e. for each 8 bytes of useful data there needs to be 800 bytes of support
(J64).

The problem comes from the fact that for each period there may be
between 10 and 50 million observations.

Assuming that each element in the index array for a sparse noun
uses 8 bytes then this implies a memory requirement of 800 - 4000 Mb for
each period.

If it's really true that an element for each index in each sparse dimension
needs 8 bytes then the sparse implementation is quite inefficient.

A way around this could be to combine several dimensions
using #. and #: (something old-time APL programmers did
using code and decode).

Using this trick the number of sparse dimensions could be reduced to 3 or 4.
While this would reduce space requirements it introduces lots of complexity.

As things stand, sparse arrays are not supported by mapped nouns.

Given that the source is now available, how practical would it be to
implement
mapped noun support for sparse arrays? And if it was, are we talking days or
months?

Regards
David
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to