Throw away 99.99% of your data and analyze what's left ;-)

> On Jun 10, 2014, at 8:02 PM, Christopher Fusting <[email protected]> wrote:
> 
> Yes at that size I begin to see your point :).  I think the current 
> implementation is useful in the realm of gigs not petabytes.
> 
> Out of curiosity what algorithm would you choose to handle such a large data 
> set?
> 
> _Chris
> 
>> On Tuesday, June 10, 2014 7:45:40 PM UTC-4, John Myles White wrote:
>> Suppose you have 10 PB of data. What approach are you going to use to do 
>> this partitioning? Where are you going to store the permuted data? Are you 
>> going to delete the raw data or just pay for twice the storage capacity?
>> 
>>  -- John
>> 
>>> On Jun 10, 2014, at 4:28 PM, Christopher Fusting <[email protected]> wrote:
>>> 
>>> Thanks for looking into this, John.  
>>> 
>>> I'm not sure if I follow your first point exactly?  Randomly partitioning 
>>> the data up front and and sending it to each process incurs minimal network 
>>> overhead and thus seems feasible.
>>> 
>>> Your second point is absolutely correct, and I agree this does limit the 
>>> theoretical application of this algorithm.  However, in the author's 
>>> experiments they do assess the algorithm on the unbounded squared error 
>>> despite this limitation.  Perhaps it's worth testing it empirically anyway.
>>> 
>>> _Chris
>>> 
>>>> On Monday, June 9, 2014 10:54:16 PM UTC-4, John Myles White wrote:
>>>> Thanks for the link, Chris. Haven’t made it through everything (the proof 
>>>> technique is pretty advanced), but there do seem to be a few significant 
>>>> limitations for practical applications. Two I noticed are:
>>>> 
>>>> (1) Assumption that data is present across machines in a randomized order. 
>>>> Definitely not realistic on a standard Hadoop installation.
>>>> 
>>>> (2) Assumption that both the cost function and its gradient are strictly 
>>>> bounded. Not true for lots of interesting models, including linear 
>>>> regression.
>>>> 
>>>> Still very cool work (as to be expected given the authors).
>>>> 
>>>>  — John
>>>> 
>>>>> On Jun 9, 2014, at 10:53 AM, Christopher Fusting <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> The results in the paper from which this algorithm was implemented are 
>>>>> encouraging: 
>>>>> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf
>>>>> 
>>>>> The proof is a bit beyond me so I cannot vouch for the theory.  I'm 
>>>>> excited to test this on some non - trivial problems to see how it fares.
>>>>> 
>>>>> _Chris
>>>>> 
>>>>>> On Monday, June 9, 2014 12:19:39 PM UTC-4, John Myles White wrote:
>>>>>> My question is about the theory behind your algorithm. My understanding 
>>>>>> is that no parallel SGD implementation (except one that trivially runs 
>>>>>> on the same data) will produce correct results in general. Is that not 
>>>>>> true?
>>>>>> 
>>>>>>  -- John
>>>>>> 
>>>>>>> On Jun 9, 2014, at 9:07 AM, Christopher Fusting <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> John, 
>>>>>>> 
>>>>>>> There has been no rigorous testing yet.  My primary concern in the 
>>>>>>> averaging algorithm is process latency, completion time, and faults.  
>>>>>>> Do you have specifics you would like to share?  
>>>>>>> 
>>>>>>> _Chris
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mon, Jun 9, 2014 at 11:24 AM, John Myles White 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> Very cool, Chris.
>>>>>>>> 
>>>>>>>> I’ve done a lot of work on SGD in Julia, so I’m glad to see more.
>>>>>>>> 
>>>>>>>> Regarding the averaging technique you’re using, have you done much 
>>>>>>>> testing to see how well it works? My sense is that the algorithm 
>>>>>>>> you’re using is a little brittle, but perhaps I’ve misunderstood it.
>>>>>>>> 
>>>>>>>>  — John
>>>>>>>> 
>>>>>>>>> On Jun 8, 2014, at 11:36 AM, Christopher Fusting <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi everyone.  I've been playing around with Julia for awhile now and 
>>>>>>>>> have implemented Parallel Stochastic Gradient Descent.  This is my 
>>>>>>>>> first Julia project (and attempt at implementing this algorithm) so 
>>>>>>>>> its not perfect, but I think I have a good start and wanted to share 
>>>>>>>>> it: https://github.com/cfusting/PSGD.jl.  I welcome any feedback.
>>>>>>>>> 
>>>>>>>>> Eventually I'd like to integrate the package with DataFrames and do a 
>>>>>>>>> little optimization, especially on the algorithm that partition the 
>>>>>>>>> data.
>>>>>>>>> 
>>>>>>>>> _Chris
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Christopher W. Fusting
>>>>>>> Software Developer / Analyst
>>>>>>> 
>>>>>>> @cfusting
>>>>>>> 828-772-0012
>> 

Reply via email to