Joe Bogner wrote:

>Can anyone share specific examples where it was needed to scale out to 
>multiple cores and machines? I 
>am interested in learning about the types of problems this would be applied 
>to. I have read some 
>examples while researching but haven't ran into anyone who has.

>For example, last week I had to create a database of the best 100,000 
>solutions out 56 billion 
>combinations as part of a work deliverable. I am sure there may have been more 
>elegant solutions 
>however brute forcing with 4 instances of R and 32 gig of ram took 3 hours, 
>which was fine. 

Here's a couple I've run into:
Marketing problems typically involve running some kind of clustering and/or 
survival analysis on a billion or more rows with up to dozens of dependent 
variables (channel, geo, time, past behavior, etc). This has gotten a lot of 
attention in recent years, and is what most commercial "big data" a la websites 
is concerned with. You want to increase ad click through rates, or sift through 
remainder ads looking for stuff that can be targeted.  You can sample, but you 
might miss a lot of the most interesting stuff.

There are plenty of forecasting problems consisting of a few hundred thousand 
or a few million highly seasonal channels with unknown but strong relationships 
(basically hierarchical clustering on ~500,000 * 700 days, or 700*90 if they 
are evil and want shorter time intervals), then standard forecasting on some 
reasonably sized aggregates. For this, I found a server with 512G and just let 
'er rip in R, but not every company has a big server like that,  I don't keep 
one 'round the house (and I might be contractually forbidden from using it 
anyway), and there are always bigger problems.

Ones I haven't done, but am aware of:
Insurance companies and other groups interested in probabilities do things like 
run logistic regression (or relatives) on terascale databases.
Social advertising companies often need to build giant social network graphs. 
These are pretty easy to do, conceptually in a SQL like thing (I'm pretty sure 
this is what Hive was invented for). If you had something better than a SQL 
like thing, you could do something better
Recommendation engines: often they do PCA or SVD type things on very large data 
sets (Netflix stuff).
Document classification on very large databases (Yandex is probably doing this 
internally, but I know others have the problem).
Real time ad serving on cell phones: involves all kinds of big data problems; 
reconstructing geographic paths taken by the cell phone, correlating it with 
local things and past behavior of the cell phone's owner, and other cell phone 
owners with similar habits. I don't think this has been done well yet (my cell 
phone has no screen, so I don't see any creepy ads), but it's definitely being 
worked on.


The simplest thing is just "big regression" or "big classification" of some 
kind. It's easy to construct an artificial data set. Stochastic gradient 
descent logistic regression might make a useful test algorithm against Mahout.

-SL
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to