Yes at that size I begin to see your point :).  I think the current 
implementation is useful in the realm of gigs not petabytes.

Out of curiosity what algorithm would you choose to handle such a large 
data set?

_Chris

On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote:
>
> Suppose you have 10 PB of data. What approach are you going to use to do 
> this partitioning? Where are you going to store the permuted data? Are you 
> going to delete the raw data or just pay for twice the storage capacity?
>
>  -- John
>
> On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected] 
> <javascript:>> wrote:
>
> Thanks for looking into this, John.  
>
> I'm not sure if I follow your first point exactly?  Randomly partitioning 
> the data up front and and sending it to each process incurs minimal network 
> overhead and thus seems feasible.
>
> Your second point is absolutely correct, and I agree this does limit the 
> theoretical application of this algorithm.  However, in the author's 
> experiments they do assess the algorithm on the unbounded squared error 
> despite this limitation.  Perhaps it's worth testing it empirically anyway.
>
> _Chris
>
> On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote:
>>
>> Thanks for the link, Chris. Haven’t made it through everything (the proof 
>> technique is pretty advanced), but there do seem to be a few significant 
>> limitations for practical applications. Two I noticed are:
>>
>> (1) Assumption that data is present across machines in a randomized 
>> order. Definitely not realistic on a standard Hadoop installation.
>>
>> (2) Assumption that both the cost function and its gradient are strictly 
>> bounded. Not true for lots of interesting models, including linear 
>> regression.
>>
>> Still very cool work (as to be expected given the authors).
>>
>>  — John
>>
>> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]> 
>> wrote:
>>
>> The results in the paper from which this algorithm was implemented are 
>> encouraging: 
>> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf
>>
>> The proof is a bit beyond me so I cannot vouch for the theory.  I'm 
>> excited to test this on some non - trivial problems to see how it fares.
>>
>> _Chris
>>
>> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote:
>>>
>>> My question is about the theory behind your algorithm. My understanding 
>>> is that no parallel SGD implementation (except one that trivially runs on 
>>> the same data) will produce correct results in general. Is that not true?
>>>
>>>  -- John
>>>
>>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]> 
>>> wrote:
>>>
>>> John, 
>>>
>>> There has been no rigorous testing yet.  My primary concern in the 
>>> averaging algorithm is process latency, completion time, and faults.  Do 
>>> you have specifics you would like to share?  
>>>
>>> _Chris
>>>
>>>
>>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White <[email protected]> 
>>> wrote:
>>>
>>>> Very cool, Chris.
>>>>
>>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more.
>>>>
>>>> Regarding the averaging technique you’re using, have you done much 
>>>> testing to see how well it works? My sense is that the algorithm you’re 
>>>> using is a little brittle, but perhaps I’ve misunderstood it.
>>>>
>>>>  — John
>>>>
>>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]> 
>>>> wrote:
>>>>
>>>> Hi everyone.  I've been playing around with Julia for awhile now and 
>>>> have implemented Parallel Stochastic Gradient Descent.  This is my first 
>>>> Julia project (and attempt at implementing this algorithm) so its not 
>>>> perfect, but I think I have a good start and wanted to share it: 
>>>> https://github.com/cfusting/PSGD.jl.  I welcome any feedback.
>>>>
>>>> Eventually I'd like to integrate the package with DataFrames and do a 
>>>> little optimization, especially on the algorithm that partition the data.
>>>>
>>>> _Chris
>>>>
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Christopher W. Fusting
>>> *Software Developer / Analyst*
>>>
>>> @cfusting
>>> 828-772-0012
>>>  
>>>
>>>
>>
>

Reply via email to