Throwing 99.99% of your data away and analyzing the rest many times to see the distribution of analysis results, is probably one of the most effective ways of analyzing very large amounts of data. If you have a consensus, then you're in good shape; if not, your analysis has some issues. Best of all, this is an embarrassingly parallel problem after the sampling step.
On Wed, Jun 11, 2014 at 12:11 PM, John Myles White <[email protected] > wrote: > Stefan’s half-joking suggestion is probably the most statistically valid > approach. It’s unfortunately somewhat difficult to do with some systems > (Hive would be one example), but it’s the right way to go for most > applications. > > I’ve seen a lot of people write a lot of SGD code that doesn’t randomize > the data, but just works with it in the order it’s found. That can work out > alright in some cases, but means that you lose all the nice theoretical > guarantees people go to lengths to prove. > > For really large-scale learning of simple models, it’s also possible to > parallelize the evaluation of the log likelihood function and do > optimization on it. For some models, this can work out quite well: almost > all the run time of the algorithm is the log likelihood evaluation, which > means that the network latency of passing parameters around isn’t > unforgiveable. > > — John > > On Jun 10, 2014, at 10:21 PM, Stefan Karpinski <[email protected]> > wrote: > > Throw away 99.99% of your data and analyze what's left ;-) > > On Jun 10, 2014, at 8:02 PM, Christopher Fusting <[email protected]> > wrote: > > Yes at that size I begin to see your point :). I think the current > implementation is useful in the realm of gigs not petabytes. > > Out of curiosity what algorithm would you choose to handle such a large > data set? > > _Chris > > On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote: >> >> Suppose you have 10 PB of data. What approach are you going to use to do >> this partitioning? Where are you going to store the permuted data? Are you >> going to delete the raw data or just pay for twice the storage capacity? >> >> -- John >> >> On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected]> >> wrote: >> >> Thanks for looking into this, John. >> >> I'm not sure if I follow your first point exactly? Randomly partitioning >> the data up front and and sending it to each process incurs minimal network >> overhead and thus seems feasible. >> >> Your second point is absolutely correct, and I agree this does limit the >> theoretical application of this algorithm. However, in the author's >> experiments they do assess the algorithm on the unbounded squared error >> despite this limitation. Perhaps it's worth testing it empirically anyway. >> >> _Chris >> >> On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote: >>> >>> Thanks for the link, Chris. Haven’t made it through everything (the >>> proof technique is pretty advanced), but there do seem to be a few >>> significant limitations for practical applications. Two I noticed are: >>> >>> (1) Assumption that data is present across machines in a randomized >>> order. Definitely not realistic on a standard Hadoop installation. >>> >>> (2) Assumption that both the cost function and its gradient are strictly >>> bounded. Not true for lots of interesting models, including linear >>> regression. >>> >>> Still very cool work (as to be expected given the authors). >>> >>> — John >>> >>> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]> >>> wrote: >>> >>> The results in the paper from which this algorithm was implemented are >>> encouraging: http://www.research.rutgers.edu/~lihong/ >>> pub/Zinkevich11Parallelized.pdf >>> >>> The proof is a bit beyond me so I cannot vouch for the theory. I'm >>> excited to test this on some non - trivial problems to see how it fares. >>> >>> _Chris >>> >>> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote: >>>> >>>> My question is about the theory behind your algorithm. My understanding >>>> is that no parallel SGD implementation (except one that trivially runs on >>>> the same data) will produce correct results in general. Is that not true? >>>> >>>> -- John >>>> >>>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]> >>>> wrote: >>>> >>>> John, >>>> >>>> There has been no rigorous testing yet. My primary concern in the >>>> averaging algorithm is process latency, completion time, and faults. Do >>>> you have specifics you would like to share? >>>> >>>> _Chris >>>> >>>> >>>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White <[email protected] >>>> > wrote: >>>> >>>>> Very cool, Chris. >>>>> >>>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more. >>>>> >>>>> Regarding the averaging technique you’re using, have you done much >>>>> testing to see how well it works? My sense is that the algorithm you’re >>>>> using is a little brittle, but perhaps I’ve misunderstood it. >>>>> >>>>> — John >>>>> >>>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]> >>>>> wrote: >>>>> >>>>> Hi everyone. I've been playing around with Julia for awhile now and >>>>> have implemented Parallel Stochastic Gradient Descent. This is my first >>>>> Julia project (and attempt at implementing this algorithm) so its not >>>>> perfect, but I think I have a good start and wanted to share it: >>>>> https://github.com/cfusting/PSGD.jl. I welcome any feedback. >>>>> >>>>> Eventually I'd like to integrate the package with DataFrames and do a >>>>> little optimization, especially on the algorithm that partition the data. >>>>> >>>>> _Chris >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Christopher W. Fusting >>>> *Software Developer / Analyst* >>>> >>>> @cfusting >>>> 828-772-0012 >>>> >>>> >>>> >>> >> >
