We have just released flightcaster.com which uses statistical inference and
machine learning to predict flight delays in advance of airlines (initial
results appear to do so with 85 - 90 % accuracy.)
The webserver and webapp are all rails running on the Heroku platform; which
also serves our blackberry and iphone apps.
The research and data heavy lifting is all in Clojure.
Distributed data mining is done via a custom layer on top of cascading
(which is a layer on top of hadoop.) All run on EC2 and S3 using the very
nice cloudera AMIs and deployment scripts.
In addition to the machine learning, the layer atop cascading performs all
the complex data data filtering and transformation operations; including
distributed joins from heterogeneous data sources and transformations into a
time series view that is fed to the machine learning computations that are
rolled into mappers and reducers. Remember, this is data from airlines and
the FAA, it is not pretty. Web data is messy but we have lots of good
frameworks, libs and sanitizers for web data.
We wrapped cascading in a thin layer that we use to wrap clojure functions
in the cascading function objects and inject those into individual steps in
the workflows. This gets us very close to normal function composition for
the client code. Ultimately, we want to be able to do normal function
composition to compose cascading workflows in the same way as we would would
do vanilla function composition for small test runs on our local machines.
This is an execution agnostic programming model; client code doesn't bear
the signs of distributed execution.
As a beneficial side effect, we found that this model forces us to have more
fine grained abstractions - because each operation must be ultimately be
injectable into a map-reduce phase, otherwise your paralleizm will be
unnecessarily course grained. This steers us clear of monolithic uber
Another aspect of the design that allows us to do this is that the data
transformations write out clojure data structure literals, so we are
entirely insulated from the normal hadoop input/output formats...the wrapper
layer just uses the normal clojure reader to read in the strings from
hadoopand apply the vanilla
clojure functions to the data structures. But we are not limited to only
clojure data structure literals. We also inject other readers that can read
other strings to clojure data structures, for example. we use Dan
json lib for the initial reads of the raw json data we store.
All the analytical code is custom, so we don't use many 3rd party libs
outside of cascading, hadoop, the invaluable jets3t for working with s3.
Oh, and of course, - since we do so much with temporal analysis - joda-time
is the only way to work with dates in a sane way on the jvm. :-)
If you travel a lot, check us out: flightcaster.com ... we have iphone and
blackberry apps. Unfortunately this is domestic US air travel only at the
moment due to the difficulty of of obtaining data for international carriers
and aviation agencies.
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to email@example.com
Note that posts from new members are moderated - please be patient with your
To unsubscribe from this group, send email to
For more options, visit this group at