Re: [julia-users] Parallel Stochastic Gradient Descent

Stefan Karpinski Wed, 11 Jun 2014 13:15:07 -0700

Sorry, I should have hedged more and said "your analysis *may* have some
issues". I'm clearly not a statistician – I don't hedge my statements
nearly enough :-P



On Wed, Jun 11, 2014 at 4:09 PM, John Myles White <[email protected]>
wrote:

> If you have a consensus, then you're in good shape; if not, your analysis
> has some issues
>
>
> This isn't always true: if you don't have consensus, it could simply be
> because some features have distributions that are very hard to measure.
> This would happen, for example, if you were analyzing extremely sparse text
> data where rare terms are extremely good predictors of outcomes. In that
> case, you'd expect that almost all subsamples will lead you to assign
> exactly zero weight to that feature, but that a very small number will
> assign substantively non-zero weights.
>
>  -- John
>
> On Jun 11, 2014, at 1:04 PM, Stefan Karpinski <[email protected]>
> wrote:
>
> Throwing 99.99% of your data away and analyzing the rest many times to see
> the distribution of analysis results, is probably one of the most effective
> ways of analyzing very large amounts of data. If you have a consensus, then
> you're in good shape; if not, your analysis has some issues. Best of all,
> this is an embarrassingly parallel problem after the sampling step.
>
>
> On Wed, Jun 11, 2014 at 12:11 PM, John Myles White <
> [email protected]> wrote:
>
>> Stefan’s half-joking suggestion is probably the most statistically valid
>> approach. It’s unfortunately somewhat difficult to do with some systems
>> (Hive would be one example), but it’s the right way to go for most
>> applications.
>>
>> I’ve seen a lot of people write a lot of SGD code that doesn’t randomize
>> the data, but just works with it in the order it’s found. That can work out
>> alright in some cases, but means that you lose all the nice theoretical
>> guarantees people go to lengths to prove.
>>
>> For really large-scale learning of simple models, it’s also possible to
>> parallelize the evaluation of the log likelihood function and do
>> optimization on it. For some models, this can work out quite well: almost
>> all the run time of the algorithm is the log likelihood evaluation, which
>> means that the network latency of passing parameters around isn’t
>> unforgiveable.
>>
>>  — John
>>
>> On Jun 10, 2014, at 10:21 PM, Stefan Karpinski <
>> [email protected]> wrote:
>>
>> Throw away 99.99% of your data and analyze what's left ;-)
>>
>> On Jun 10, 2014, at 8:02 PM, Christopher Fusting <[email protected]>
>> wrote:
>>
>> Yes at that size I begin to see your point :).  I think the current
>> implementation is useful in the realm of gigs not petabytes.
>>
>> Out of curiosity what algorithm would you choose to handle such a large
>> data set?
>>
>> _Chris
>>
>> On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote:
>>>
>>> Suppose you have 10 PB of data. What approach are you going to use to do
>>> this partitioning? Where are you going to store the permuted data? Are you
>>> going to delete the raw data or just pay for twice the storage capacity?
>>>
>>>  -- John
>>>
>>> On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected]>
>>> wrote:
>>>
>>> Thanks for looking into this, John.
>>>
>>> I'm not sure if I follow your first point exactly?  Randomly
>>> partitioning the data up front and and sending it to each process incurs
>>> minimal network overhead and thus seems feasible.
>>>
>>> Your second point is absolutely correct, and I agree this does limit the
>>> theoretical application of this algorithm.  However, in the author's
>>> experiments they do assess the algorithm on the unbounded squared error
>>> despite this limitation.  Perhaps it's worth testing it empirically anyway.
>>>
>>> _Chris
>>>
>>> On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote:
>>>>
>>>> Thanks for the link, Chris. Haven’t made it through everything (the
>>>> proof technique is pretty advanced), but there do seem to be a few
>>>> significant limitations for practical applications. Two I noticed are:
>>>>
>>>>  (1) Assumption that data is present across machines in a randomized
>>>> order. Definitely not realistic on a standard Hadoop installation.
>>>>
>>>> (2) Assumption that both the cost function and its gradient are
>>>> strictly bounded. Not true for lots of interesting models, including linear
>>>> regression.
>>>>
>>>> Still very cool work (as to be expected given the authors).
>>>>
>>>>  — John
>>>>
>>>> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]>
>>>> wrote:
>>>>
>>>> The results in the paper from which this algorithm was implemented are
>>>> encouraging: http://www.research.rutgers.edu/~lihong/
>>>> pub/Zinkevich11Parallelized.pdf
>>>>
>>>> The proof is a bit beyond me so I cannot vouch for the theory.  I'm
>>>> excited to test this on some non - trivial problems to see how it fares.
>>>>
>>>> _Chris
>>>>
>>>> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote:
>>>>>
>>>>> My question is about the theory behind your algorithm. My
>>>>> understanding is that no parallel SGD implementation (except one that
>>>>> trivially runs on the same data) will produce correct results in general.
>>>>> Is that not true?
>>>>>
>>>>>  -- John
>>>>>
>>>>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]>
>>>>> wrote:
>>>>>
>>>>> John,
>>>>>
>>>>> There has been no rigorous testing yet.  My primary concern in the
>>>>> averaging algorithm is process latency, completion time, and faults.  Do
>>>>> you have specifics you would like to share?
>>>>>
>>>>> _Chris
>>>>>
>>>>>
>>>>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Very cool, Chris.
>>>>>>
>>>>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more.
>>>>>>
>>>>>> Regarding the averaging technique you’re using, have you done much
>>>>>> testing to see how well it works? My sense is that the algorithm you’re
>>>>>> using is a little brittle, but perhaps I’ve misunderstood it.
>>>>>>
>>>>>>  — John
>>>>>>
>>>>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi everyone.  I've been playing around with Julia for awhile now and
>>>>>> have implemented Parallel Stochastic Gradient Descent.  This is my first
>>>>>> Julia project (and attempt at implementing this algorithm) so its not
>>>>>> perfect, but I think I have a good start and wanted to share it:
>>>>>> https://github.com/cfusting/PSGD.jl.  I welcome any feedback.
>>>>>>
>>>>>> Eventually I'd like to integrate the package with DataFrames and do a
>>>>>> little optimization, especially on the algorithm that partition the data.
>>>>>>
>>>>>> _Chris
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Christopher W. Fusting
>>>>> *Software Developer / Analyst*
>>>>>
>>>>> @cfusting
>>>>> 828-772-0012
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: [julia-users] Parallel Stochastic Gradient Descent

Reply via email to