Hi , Since last time we discussed about memory partition of SGD, I did some literature review on single-machine and multi-machine parallel machine-learning approaches. It seems like that GPU-based learning is the dominent form of parallelism among single machine approaches. I found this allreduce <https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140694.pdf> approach is quite interesting. It implements a data-parallel distributed SGD and allows both parameter averaging, gradient aggregation, and iteration. I'm interested in particating in SGD project. Do you consider implementing this scheme appropriate in a GSoC project ?
Looking forward to your reply. Hsienyu Meng On Wed, Mar 15, 2017 at 11:17 AM, 孟憲妤 <[email protected]> wrote: > Hi, > > Thanks for you reply. > It would be of great help if I could be provided information or paper > relating to memory partition or "columnar storage". > > Thanks for your patience. > > Meng > > 2017-03-14 13:31 GMT+08:00 Stephen Tu <[email protected]>: > >> Hi, >> >> I think these definitions are for analysis purposes only. Certainly, the >> implementation should not try to track these or "break ties at random"-- >> that defeats the whole purpose! In fact, I think part of this project could >> be to understand (empirically) how different consistency levels (meaning, >> when you do a write to the model, what, if any mfences do you use, etc) in >> the algorithm affect convergence. >> >> As for memory partitioning, this is sort of open ended. I've always >> wondered if a more "columnar" storage format (ie storing the data by >> features instead of by data points) has any benefits-- this was inspired by >> a lot of the literature on column stores in databases. But you are free to >> propose other ideas. >> >> On Sun, Mar 12, 2017 at 9:33 PM, 孟憲妤 <[email protected]> wrote: >> >>> Hello, >>> >>> Thanks for your reply. >>> In page 4 of the paper : Even though the processors >>> have no knowledge as to whether any of the other processors have >>> modified x, we define x j to be >>> the state of the decision variable x after j updates have been performed >>> 1 . Since two processors can >>> write to x at the same time, we need to be a bit careful with this >>> definition, but we simply break ties >>> at random. >>> >>> I don't fully understand what does it mean by "break ties at random". >>> Does it indicate that we should store all x_k(j) for evaluating G_j(x_j) ? >>> How do we know that the state we read is the newest state(no other process >>> is writing) ? >>> In https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas#para >>> llel-stochastic-optimization-methods mentioned that we should consider >>> how to partition data in memory for good cache locality. Does it suggest >>> that we should consider the size of L1 L2 cache and the block size and >>> design the size of array not to be prone to cache miss? >>> >>> Thanks for your patience. >>> >>> Meng >>> >>> >>> 2017-03-08 15:35 GMT+08:00 Stephen Tu <[email protected]>: >>> >>>> Hi, >>>> >>>> The practical way to run parallel SGD is to use a constant stepsize per >>>> epoch, and then set the step size \eta <- \eta * \rho for some \rho \in >>>> (0,1) (typical value is like 0.9). I would not recommend changing the step >>>> size at every time. See the discussion in Section 5 (Robust 1/k rates). >>>> >>>> I don't understand your second question since there shouldn't be locks? >>>> >>>> On Tue, Mar 7, 2017 at 5:57 AM, 孟憲妤 <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> I'm HsienYu Meng from Taiwan. I'm currently in my first year of >>>>> graduate studies at Tsinghua Univ. As you mentioned, the SGD project >>>>> should implement HOGWILD! algorithm and consider to set stepsize at each >>>>> iteration. However, the paper indicates that they assume the stepsize to >>>>> be >>>>> a constant value so that the formula 7 make sense. I wonder if we change >>>>> the stepsize at each time, will the properties of HOGWILD! still holds? >>>>> Another question is, if the dataset is large and we partition it in >>>>> memory, >>>>> which kind of losk-release should we apply? Sequential consistency, >>>>> Release >>>>> consistency or lazy release consistency? >>>>> >>>>> Appreciate for your patience .Thanks! >>>>> >>>>> -- >>>>> *孟憲妤* >>>>> >>>>> *MengHsienyu**Department of Computer Science ,Tsinghua university * >>>>> *Beijing ,China* >>>>> >>>>> *100084* >>>>> >>>>> *Mobile:(+86)18201162149 <+86%20182%200116%202149>* >>>>> * (+886)0952424693 <+886%20952%20424%20693>* >>>>> >>>>> >>>>> _______________________________________________ >>>>> mlpack mailing list >>>>> [email protected] >>>>> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack >>>>> >>>> >>>> >>> >>> >>> -- >>> *孟憲妤* >>> >>> *MengXianYu**Department of Electronic Engineering ,Tsinghua university * >>> *Beijing ,China* >>> >>> *100084* >>> >>> *Mobile:(+86)18201162149 <+86%20182%200116%202149>* >>> * (+886)0952424693 <+886%20952%20424%20693>* >>> >>> >> > > > -- > *孟憲妤* > > *MengXianYu**Department of Electronic Engineering ,Tsinghua university * > *Beijing ,China* > > *100084* > > *Mobile:(+86)18201162149 <182%200116%202149>* > * (+886)0952424693 <+886%20952%20424%20693>* > > -- *孟憲妤* *HsienYu Meng**Department of Computer Science ,Tsinghua university * *Beijing ,China* Mobile:(+86)18201162149 <182%200116%202149> (+886)0952424693 <+886%20952%20424%20693>
_______________________________________________ mlpack mailing list [email protected] http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack
