Yes at that size I begin to see your point :). I think the current implementation is useful in the realm of gigs not petabytes.
Out of curiosity what algorithm would you choose to handle such a large data set? _Chris On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote: > > Suppose you have 10 PB of data. What approach are you going to use to do > this partitioning? Where are you going to store the permuted data? Are you > going to delete the raw data or just pay for twice the storage capacity? > > -- John > > On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected] > <javascript:>> wrote: > > Thanks for looking into this, John. > > I'm not sure if I follow your first point exactly? Randomly partitioning > the data up front and and sending it to each process incurs minimal network > overhead and thus seems feasible. > > Your second point is absolutely correct, and I agree this does limit the > theoretical application of this algorithm. However, in the author's > experiments they do assess the algorithm on the unbounded squared error > despite this limitation. Perhaps it's worth testing it empirically anyway. > > _Chris > > On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote: >> >> Thanks for the link, Chris. Haven’t made it through everything (the proof >> technique is pretty advanced), but there do seem to be a few significant >> limitations for practical applications. Two I noticed are: >> >> (1) Assumption that data is present across machines in a randomized >> order. Definitely not realistic on a standard Hadoop installation. >> >> (2) Assumption that both the cost function and its gradient are strictly >> bounded. Not true for lots of interesting models, including linear >> regression. >> >> Still very cool work (as to be expected given the authors). >> >> — John >> >> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]> >> wrote: >> >> The results in the paper from which this algorithm was implemented are >> encouraging: >> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf >> >> The proof is a bit beyond me so I cannot vouch for the theory. I'm >> excited to test this on some non - trivial problems to see how it fares. >> >> _Chris >> >> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote: >>> >>> My question is about the theory behind your algorithm. My understanding >>> is that no parallel SGD implementation (except one that trivially runs on >>> the same data) will produce correct results in general. Is that not true? >>> >>> -- John >>> >>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]> >>> wrote: >>> >>> John, >>> >>> There has been no rigorous testing yet. My primary concern in the >>> averaging algorithm is process latency, completion time, and faults. Do >>> you have specifics you would like to share? >>> >>> _Chris >>> >>> >>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White <[email protected]> >>> wrote: >>> >>>> Very cool, Chris. >>>> >>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more. >>>> >>>> Regarding the averaging technique you’re using, have you done much >>>> testing to see how well it works? My sense is that the algorithm you’re >>>> using is a little brittle, but perhaps I’ve misunderstood it. >>>> >>>> — John >>>> >>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]> >>>> wrote: >>>> >>>> Hi everyone. I've been playing around with Julia for awhile now and >>>> have implemented Parallel Stochastic Gradient Descent. This is my first >>>> Julia project (and attempt at implementing this algorithm) so its not >>>> perfect, but I think I have a good start and wanted to share it: >>>> https://github.com/cfusting/PSGD.jl. I welcome any feedback. >>>> >>>> Eventually I'd like to integrate the package with DataFrames and do a >>>> little optimization, especially on the algorithm that partition the data. >>>> >>>> _Chris >>>> >>>> >>>> >>> >>> >>> -- >>> Christopher W. Fusting >>> *Software Developer / Analyst* >>> >>> @cfusting >>> 828-772-0012 >>> >>> >>> >> >
