Sorry, I should have hedged more and said "your analysis *may* have some issues". I'm clearly not a statistician – I don't hedge my statements nearly enough :-P
On Wed, Jun 11, 2014 at 4:09 PM, John Myles White <[email protected]> wrote: > If you have a consensus, then you're in good shape; if not, your analysis > has some issues > > > This isn't always true: if you don't have consensus, it could simply be > because some features have distributions that are very hard to measure. > This would happen, for example, if you were analyzing extremely sparse text > data where rare terms are extremely good predictors of outcomes. In that > case, you'd expect that almost all subsamples will lead you to assign > exactly zero weight to that feature, but that a very small number will > assign substantively non-zero weights. > > -- John > > On Jun 11, 2014, at 1:04 PM, Stefan Karpinski <[email protected]> > wrote: > > Throwing 99.99% of your data away and analyzing the rest many times to see > the distribution of analysis results, is probably one of the most effective > ways of analyzing very large amounts of data. If you have a consensus, then > you're in good shape; if not, your analysis has some issues. Best of all, > this is an embarrassingly parallel problem after the sampling step. > > > On Wed, Jun 11, 2014 at 12:11 PM, John Myles White < > [email protected]> wrote: > >> Stefan’s half-joking suggestion is probably the most statistically valid >> approach. It’s unfortunately somewhat difficult to do with some systems >> (Hive would be one example), but it’s the right way to go for most >> applications. >> >> I’ve seen a lot of people write a lot of SGD code that doesn’t randomize >> the data, but just works with it in the order it’s found. That can work out >> alright in some cases, but means that you lose all the nice theoretical >> guarantees people go to lengths to prove. >> >> For really large-scale learning of simple models, it’s also possible to >> parallelize the evaluation of the log likelihood function and do >> optimization on it. For some models, this can work out quite well: almost >> all the run time of the algorithm is the log likelihood evaluation, which >> means that the network latency of passing parameters around isn’t >> unforgiveable. >> >> — John >> >> On Jun 10, 2014, at 10:21 PM, Stefan Karpinski < >> [email protected]> wrote: >> >> Throw away 99.99% of your data and analyze what's left ;-) >> >> On Jun 10, 2014, at 8:02 PM, Christopher Fusting <[email protected]> >> wrote: >> >> Yes at that size I begin to see your point :). I think the current >> implementation is useful in the realm of gigs not petabytes. >> >> Out of curiosity what algorithm would you choose to handle such a large >> data set? >> >> _Chris >> >> On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote: >>> >>> Suppose you have 10 PB of data. What approach are you going to use to do >>> this partitioning? Where are you going to store the permuted data? Are you >>> going to delete the raw data or just pay for twice the storage capacity? >>> >>> -- John >>> >>> On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected]> >>> wrote: >>> >>> Thanks for looking into this, John. >>> >>> I'm not sure if I follow your first point exactly? Randomly >>> partitioning the data up front and and sending it to each process incurs >>> minimal network overhead and thus seems feasible. >>> >>> Your second point is absolutely correct, and I agree this does limit the >>> theoretical application of this algorithm. However, in the author's >>> experiments they do assess the algorithm on the unbounded squared error >>> despite this limitation. Perhaps it's worth testing it empirically anyway. >>> >>> _Chris >>> >>> On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote: >>>> >>>> Thanks for the link, Chris. Haven’t made it through everything (the >>>> proof technique is pretty advanced), but there do seem to be a few >>>> significant limitations for practical applications. Two I noticed are: >>>> >>>> (1) Assumption that data is present across machines in a randomized >>>> order. Definitely not realistic on a standard Hadoop installation. >>>> >>>> (2) Assumption that both the cost function and its gradient are >>>> strictly bounded. Not true for lots of interesting models, including linear >>>> regression. >>>> >>>> Still very cool work (as to be expected given the authors). >>>> >>>> — John >>>> >>>> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]> >>>> wrote: >>>> >>>> The results in the paper from which this algorithm was implemented are >>>> encouraging: http://www.research.rutgers.edu/~lihong/ >>>> pub/Zinkevich11Parallelized.pdf >>>> >>>> The proof is a bit beyond me so I cannot vouch for the theory. I'm >>>> excited to test this on some non - trivial problems to see how it fares. >>>> >>>> _Chris >>>> >>>> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote: >>>>> >>>>> My question is about the theory behind your algorithm. My >>>>> understanding is that no parallel SGD implementation (except one that >>>>> trivially runs on the same data) will produce correct results in general. >>>>> Is that not true? >>>>> >>>>> -- John >>>>> >>>>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]> >>>>> wrote: >>>>> >>>>> John, >>>>> >>>>> There has been no rigorous testing yet. My primary concern in the >>>>> averaging algorithm is process latency, completion time, and faults. Do >>>>> you have specifics you would like to share? >>>>> >>>>> _Chris >>>>> >>>>> >>>>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White < >>>>> [email protected]> wrote: >>>>> >>>>>> Very cool, Chris. >>>>>> >>>>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more. >>>>>> >>>>>> Regarding the averaging technique you’re using, have you done much >>>>>> testing to see how well it works? My sense is that the algorithm you’re >>>>>> using is a little brittle, but perhaps I’ve misunderstood it. >>>>>> >>>>>> — John >>>>>> >>>>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Hi everyone. I've been playing around with Julia for awhile now and >>>>>> have implemented Parallel Stochastic Gradient Descent. This is my first >>>>>> Julia project (and attempt at implementing this algorithm) so its not >>>>>> perfect, but I think I have a good start and wanted to share it: >>>>>> https://github.com/cfusting/PSGD.jl. I welcome any feedback. >>>>>> >>>>>> Eventually I'd like to integrate the package with DataFrames and do a >>>>>> little optimization, especially on the algorithm that partition the data. >>>>>> >>>>>> _Chris >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Christopher W. Fusting >>>>> *Software Developer / Analyst* >>>>> >>>>> @cfusting >>>>> 828-772-0012 >>>>> >>>>> >>>>> >>>> >>> >> > >
