Hi ,

Since last time we discussed about memory partition of SGD, I did  some
literature review on single-machine and multi-machine parallel
machine-learning approaches. It seems like that GPU-based learning is the
dominent form of parallelism among single machine approaches. I found this
allreduce
<https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140694.pdf>
approach
is quite interesting. It implements a data-parallel distributed SGD and
 allows both parameter averaging, gradient aggregation, and iteration. I'm
interested in particating in SGD project. Do you consider implementing this
scheme appropriate in a GSoC project ?

Looking forward to your reply.

Hsienyu Meng

On Wed, Mar 15, 2017 at 11:17 AM, 孟憲妤 <[email protected]> wrote:

> Hi,
>
> Thanks for you reply.
> It would be of great help if I could be provided information or paper
> relating to memory partition or "columnar storage".
>
> Thanks for your patience.
>
> Meng
>
> 2017-03-14 13:31 GMT+08:00 Stephen Tu <[email protected]>:
>
>> Hi,
>>
>> I think these definitions are for analysis purposes only. Certainly, the
>> implementation should not try to track these or "break ties at random"--
>> that defeats the whole purpose! In fact, I think part of this project could
>> be to understand (empirically) how different consistency levels (meaning,
>> when you do a write to the model, what, if any mfences do you use, etc) in
>> the algorithm affect convergence.
>>
>> As for memory partitioning, this is sort of open ended. I've always
>> wondered if a more "columnar" storage format (ie storing the data by
>> features instead of by data points) has any benefits-- this was inspired by
>> a lot of the literature on column stores in databases. But you are free to
>> propose other ideas.
>>
>> On Sun, Mar 12, 2017 at 9:33 PM, 孟憲妤 <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> Thanks for your reply.
>>> In page 4 of the paper : Even though the processors
>>> have no knowledge as to whether any of the other processors have
>>> modified x, we define x j to be
>>> the state of the decision variable x after j updates have been performed
>>> 1 . Since two processors can
>>> write to x at the same time, we need to be a bit careful with this
>>> definition, but we simply break ties
>>> at random.
>>>
>>> I don't fully understand what does it mean by "break ties at random".
>>> Does it indicate that we should store all x_k(j) for evaluating G_j(x_j) ?
>>> How do we know that the state we read is the newest state(no other process
>>> is writing) ?
>>> In https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas#para
>>> llel-stochastic-optimization-methods mentioned that we should consider
>>> how to partition data in memory for good cache locality. Does it suggest
>>> that we should consider the size of L1 L2 cache and the block size and
>>> design the size of array not to be prone to cache miss?
>>>
>>> Thanks for your patience.
>>>
>>> Meng
>>>
>>>
>>> 2017-03-08 15:35 GMT+08:00 Stephen Tu <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> The practical way to run parallel SGD is to use a constant stepsize per
>>>> epoch, and then set the step size \eta <- \eta * \rho for some \rho \in
>>>> (0,1) (typical value is like 0.9). I would not recommend changing the step
>>>> size at every time. See the discussion in Section 5 (Robust 1/k rates).
>>>>
>>>> I don't understand your second question since there shouldn't be locks?
>>>>
>>>> On Tue, Mar 7, 2017 at 5:57 AM, 孟憲妤 <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm HsienYu Meng from Taiwan. I'm currently in my first year of
>>>>> graduate studies at Tsinghua Univ.  As you mentioned, the SGD project
>>>>> should implement HOGWILD! algorithm and consider to set stepsize at each
>>>>> iteration. However, the paper indicates that they assume the stepsize to 
>>>>> be
>>>>> a constant value so that the formula 7 make sense. I wonder if we change
>>>>> the stepsize at each time, will the properties of HOGWILD! still holds?
>>>>> Another question is, if the dataset is large and we partition it in 
>>>>> memory,
>>>>> which kind of losk-release should we apply? Sequential consistency, 
>>>>> Release
>>>>> consistency or lazy release consistency?
>>>>>
>>>>> Appreciate for your patience .Thanks!
>>>>>
>>>>> --
>>>>> *孟憲妤*
>>>>>
>>>>> *MengHsienyu**Department of Computer Science ,Tsinghua university *
>>>>> *Beijing ,China*
>>>>>
>>>>> *100084*
>>>>>
>>>>> *Mobile:(+86)18201162149 <+86%20182%200116%202149>*
>>>>> *        (+886)0952424693 <+886%20952%20424%20693>*
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mlpack mailing list
>>>>> [email protected]
>>>>> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *孟憲妤*
>>>
>>> *MengXianYu**Department of Electronic Engineering ,Tsinghua university *
>>> *Beijing ,China*
>>>
>>> *100084*
>>>
>>> *Mobile:(+86)18201162149 <+86%20182%200116%202149>*
>>> *        (+886)0952424693 <+886%20952%20424%20693>*
>>>
>>>
>>
>
>
> --
> *孟憲妤*
>
> *MengXianYu**Department of Electronic Engineering ,Tsinghua university *
> *Beijing ,China*
>
> *100084*
>
> *Mobile:(+86)18201162149 <182%200116%202149>*
> *        (+886)0952424693 <+886%20952%20424%20693>*
>
>


-- 
*孟憲妤*

*HsienYu Meng**Department of Computer Science ,Tsinghua university *
*Beijing ,China*

Mobile:(+86)18201162149 <182%200116%202149>
        (+886)0952424693 <+886%20952%20424%20693>
_______________________________________________
mlpack mailing list
[email protected]
http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

Reply via email to