Thanks Michael.

> This sounds very similar to NoSQL and Map/Reduce?

I'm not so sure about that (which may be mostly due to my ignorance of
NoSQL and Map/Reduce). The amount of data involved in my problem is
quite small and any infrastructure aimed at massive scaling may bring
a load of conceptual and implementation baggage that is unnecessary/
unhelpful.

Let me restate my problem:

I have a bunch of statistician colleagues with minimal programming
skills. (I am also a statistician, but with slightly better
programming skills.) As part of our analytical workflow we take data
sets and preprocess them by adding new variables that are typically
aggregate functions of other values. We source the data form a
database/file, add the new variables, and store the augmented data in
a database/file for subsequent, extensive and extended (a couple of
months) analysis with other tools (off the shelf statistical packages
such as SAS and R).  After the analyses are complete, some subset of
the preprocessing calculations need to be implemented in an
operational environment. This is currently done by completely re-
implementing them in yet another fairly basic imperative language.

The preprocessing in our analytical environment is usually written in
a combination of SQL and the SAS data manipulation language (think of
it as a very basic imperative language with macros but no user-defined
functions). The statisticians take a long time to get their
preprocessing right (they're not good at nested queries in SQL and
make all the usual errors iterating over arrays of values with
imperative code). So my primary goal is to find/build a query language
that minimises the cognitive impedance mismatch with the statisticians
and minimises their opportunity for error.

Another goal is that the same mechanism should be applicable in our
statistical analytical environment and the corporate deployment
environment(s). The most different operational environment is online
and realtime. The data describing one case gets thrown at some code
that (among other things) implements the preprocessing with some
embedded imperative code. So, linking in some Java byte code to do the
preprocessing on a single case sounds feasible, whereas replacing/
augmenting the current corporate infrastructure with NoSQL and a CPU
farm is more aggravation with corporate IT than I am paid for.

The final goal is that the preprocessing mechanism should be no slower
than the current methods in each of the deployment environments. The
hardest one is probably in our statistical analysis environment, but
there we do have the option of farming the work across multiple CPUs
if needed.

Let me describe the computational scale of the problem - it is really
quite small.

Data is organised as completely independent cases.  One case might
contain 500 primitive values for a total size of ~1kb. Preprocessing
might calulate another 500 values, each of those being an aggregate
function of some subset (say, 20 values) of the original 500 values.
Currently, all these new values are calculated independently of each
other, but there is a lot of overlap of intermediate results and,
therefore, potential for optimisation of the computational effort
required to calculate the entire set of results within a single case.

In our statistical analytical environment the preprocessing is carried
out in batch mode. A large dataset might contain 1M cases (~1GB of
data). We can churn through the preprocessing at ~300 cases/second on
a modest PC.  Higher throughput in our analytical environment would be
a bonus, but not essential.

So I see the problem as primarily about the conceptual design of the
query language, with some side constraints about implementation
compatibility across a range of deployment environments and adequate
throughput performance.

As I mentioned in an earlier post, I'll probably assemble a collection
of representative queries, express them in a variety of query
languages, and try to assess how compatible the different query
languages are with the way my colleagues want to think about the
proble.

Ross



On Oct 3, 11:31 am, Michael Ossareh <ossa...@gmail.com> wrote:
> On Fri, Oct 1, 2010 at 17:55, Ross Gayler <r.gay...@gmail.com> wrote:
> > Hi,
>
> > This is probably an abuse of the Clojure forum, but it is a bit
> > Clojure-related and strikes me as the sort of thing that a bright,
> > eclectic bunch of Clojure users might know about. (Plus I'm not really
> > a software person, so I need all the help I can get.)
>
> > I am looking at the possibility of finding/building a declarative data
> > aggregation language operating on a small relational representation.
> > Each query identifies a set of rows satisfying some relational
> > predicate and calculates some aggregate function of a set of values
> > (e.g. min, max, sum). There might be ~20 input tables of up to ~1k
> > rows.  The data is immutable - it gets loaded and never changed. The
> > results of the queries get loaded as new rows in other tables and are
> > eventually used as input to other computations. There might be ~1k
> > queries. There is no requirement for transaction management or any
> > inherent concurrency (there is only one consumer of the results).
> > There is no requirement for persistent storage - the aggregation is
> > the only thing of interest. I would like the query language to map as
> > directly as possible to the task (SQL is powerful enough, but can get
> > very contorted and opaque for some of the queries). There is
> > considerable scope for optimisation of the calculations over the total
> > set of queries as partial results are common across many of the
> > queries.
>
> > I would like to be able to do this in Clojure (which I have not yet
> > used), partly for some very practical reasons to do with Java interop
> > and partly because Clojure looks very cool.
>
> > * Is there any existing Clojure functionality which looks like a good
> > fit to this problem?
>
> > I have looked at Clojure-Datalog. It looks like a pretty good fit
> > except that it lacks the aggregation operators. Apart from that the
> > deductive power is probably greater than I need (although that doesn't
> > necessarily cost me anything).  I know that there are other (non-
> > Clojure) Datalog implementations that have been extended with
> > aggregation operators (e.g. DLV
> >http://www.mat.unical.it/dlv-complex/dlv-complex).
>
> > Tutorial D (what SQL should have been
> >http://en.wikipedia.org/wiki/D_%28data_language_specification%29#Tuto...
> > )
> > might be a good fit, although once again, there is probably a lot of
> > conceptual and implementation baggage (e.g. Rel
> >http://dbappbuilder.sourceforge.net/Rel.php)
> > that I don't need.
>
> > * Is there a Clojure implementation of something like Tutorial D?
>
> > If there is no implementation of anything that meets my requirements
> > then I would be willing to look at the possibility of creating a
> > Domain Specific language.  However, I am wary of launching straight
> > into that because of the probability that anything I dreamed up would
> > be an ad hoc kludge rather than a semantically complete and consistent
> > language. Optimised execution would be a whole other can of worms.
>
> > * Does anyone know of any DSLs/formalisms for declaratively specifying
> > relational data aggregations?
>
> > Thanks
>
> > Ross
>
> This sounds very similar to NoSQL and 
> Map/Reduce?http://www.basho.com/Riak.html
>
> Where your predicate is a reduce fn?- Hide quoted text -
>
> - Show quoted text -

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to