Throw away 99.99% of your data and analyze what's left ;-)
> On Jun 10, 2014, at 8:02 PM, Christopher Fusting <[email protected]> wrote: > > Yes at that size I begin to see your point :). I think the current > implementation is useful in the realm of gigs not petabytes. > > Out of curiosity what algorithm would you choose to handle such a large data > set? > > _Chris > >> On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote: >> Suppose you have 10 PB of data. What approach are you going to use to do >> this partitioning? Where are you going to store the permuted data? Are you >> going to delete the raw data or just pay for twice the storage capacity? >> >> -- John >> >>> On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected]> wrote: >>> >>> Thanks for looking into this, John. >>> >>> I'm not sure if I follow your first point exactly? Randomly partitioning >>> the data up front and and sending it to each process incurs minimal network >>> overhead and thus seems feasible. >>> >>> Your second point is absolutely correct, and I agree this does limit the >>> theoretical application of this algorithm. However, in the author's >>> experiments they do assess the algorithm on the unbounded squared error >>> despite this limitation. Perhaps it's worth testing it empirically anyway. >>> >>> _Chris >>> >>>> On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote: >>>> Thanks for the link, Chris. Haven’t made it through everything (the proof >>>> technique is pretty advanced), but there do seem to be a few significant >>>> limitations for practical applications. Two I noticed are: >>>> >>>> (1) Assumption that data is present across machines in a randomized order. >>>> Definitely not realistic on a standard Hadoop installation. >>>> >>>> (2) Assumption that both the cost function and its gradient are strictly >>>> bounded. Not true for lots of interesting models, including linear >>>> regression. >>>> >>>> Still very cool work (as to be expected given the authors). >>>> >>>> — John >>>> >>>>> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]> >>>>> wrote: >>>>> >>>>> The results in the paper from which this algorithm was implemented are >>>>> encouraging: >>>>> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf >>>>> >>>>> The proof is a bit beyond me so I cannot vouch for the theory. I'm >>>>> excited to test this on some non - trivial problems to see how it fares. >>>>> >>>>> _Chris >>>>> >>>>>> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote: >>>>>> My question is about the theory behind your algorithm. My understanding >>>>>> is that no parallel SGD implementation (except one that trivially runs >>>>>> on the same data) will produce correct results in general. Is that not >>>>>> true? >>>>>> >>>>>> -- John >>>>>> >>>>>>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> John, >>>>>>> >>>>>>> There has been no rigorous testing yet. My primary concern in the >>>>>>> averaging algorithm is process latency, completion time, and faults. >>>>>>> Do you have specifics you would like to share? >>>>>>> >>>>>>> _Chris >>>>>>> >>>>>>> >>>>>>>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White >>>>>>>> <[email protected]> wrote: >>>>>>>> Very cool, Chris. >>>>>>>> >>>>>>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more. >>>>>>>> >>>>>>>> Regarding the averaging technique you’re using, have you done much >>>>>>>> testing to see how well it works? My sense is that the algorithm >>>>>>>> you’re using is a little brittle, but perhaps I’ve misunderstood it. >>>>>>>> >>>>>>>> — John >>>>>>>> >>>>>>>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi everyone. I've been playing around with Julia for awhile now and >>>>>>>>> have implemented Parallel Stochastic Gradient Descent. This is my >>>>>>>>> first Julia project (and attempt at implementing this algorithm) so >>>>>>>>> its not perfect, but I think I have a good start and wanted to share >>>>>>>>> it: https://github.com/cfusting/PSGD.jl. I welcome any feedback. >>>>>>>>> >>>>>>>>> Eventually I'd like to integrate the package with DataFrames and do a >>>>>>>>> little optimization, especially on the algorithm that partition the >>>>>>>>> data. >>>>>>>>> >>>>>>>>> _Chris >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Christopher W. Fusting >>>>>>> Software Developer / Analyst >>>>>>> >>>>>>> @cfusting >>>>>>> 828-772-0012 >>
