Yeah... some of his changes are actually tracking some of our stuff (not by
looking at our stuff, but looking at the common upstream idea source).  The
active learning stuff is a different case and would be excellent to have.
 Essentially that involves sieving out training examples that "don't matter"
which, in my experience, is typically 99+% of the data.  In practice,
down-sampling is the poor man's active learning, but real active learning
would be much, much better.

On Thu, Dec 23, 2010 at 10:46 AM, Robin Anil <[email protected]> wrote:

> Worth a look, maybe some new ideas for sgd package
>
> Robin
>
> http://hunch.net/?p=1594
>
> Vowpal Wabbit, version 5.0, and the second heresy<http://hunch.net/?p=1594>
> Tags: Announcements <http://hunch.net/?cat=4>, Code<http://hunch.net/?cat=42>
> , Machine Learning <http://hunch.net/?cat=29> — jl <http://hunch.net/~jl>@
> 9:52 pm
>
> I’ve released version 5.0 <http://hunch.net/~vw/vowpal_wabbit_v5.0.tar.gz> of
> the Vowpal Wabbit <http://hunch.net/~jl> online learning software. The
> major number has changed since the last release 
> <http://hunch.net/?p=1068>because
> I regard all earlier versions as obsolete—there are several new algorithms &
> features including substantial changes and upgrades to the default learning
> algorithm.
>
> The biggest changes are new algorithms:
>
>    1. Nikos <http://www.cs.cornell.edu/~nk/> and I improved the default
>    algorithm. The basic update rule still uses gradient descent, but the size
>    of the update is carefully controlled so that it’s impossible to overrun 
> the
>    label. In addition, the normalization has changed. Computationally, these
>    changes are virtually free and yield better results, sometimes much better.
>    Less careful updates can be reenabled with –loss_function classic, although
>    results are still not identical to previous due to normalization changes.
>    2. Nikos also implemented the per-feature learning rates as per these
>    two <http://www.colt2010.org/papers/104mcmahan.pdf> 
> papers<http://www.colt2010.org/papers/023Duchi.pdf>.
>    Often, this works better than the default algorithm. It isn’t the default
>    because it isn’t (yet) as adaptable in terms of learning rate decay. This 
> is
>    enabled with –adaptive and learned regressors are compatible with the
>    default. Computationally, you might see a factor of 4 slowdown if using
>    ‘-q’. Nikos noticed that the phenomenal quake inverse square 
> root<http://en.wikipedia.org/wiki/Fast_inverse_square_root> hack
>    applies making this substantially faster than a naive implementation.
>    3. Nikos and Daniel <http://cseweb.ucsd.edu/~djhsu/> also implemented
>    active learning derived from this 
> paper<http://cseweb.ucsd.edu/~djhsu/papers/iwal-cal.pdf>,
>    usable via –active_simulation (to test parameters on an existing supervised
>    dataset) or –active_learning (to do the real thing). This runs at full 
> speed
>    which is much faster than is reasonable in any active learning scenario. We
>    see this approach dominating supervised learning on all classification
>    datasets so far, often with far fewer labeled examples required, as the
>    theory predicts. The learned predictor is compatible with the default.
>    4. Olivier <http://research.yahoo.com/Olivier_Chapelle> helped me
>    implement preconditioned conjugate gradient based on Jonathan 
> Shewchuk<http://www.cs.berkeley.edu/~jrs/>
>    ’s 
> tutorial<http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf>.
>    This is a *batch* algorithm and hence requires multiple passes over any
>    dataset to do something useful. Each step of conjugate gradient requires 2
>    passes. The advantage of cg is that it converges relatively quickly via the
>    use of second derivative information. This can be particularly helpful if
>    your features are of widely differing scales. The use of –regularization
>    0.001 (or smaller) is almost required with –conjugate_gradient as it will
>    otherwise overfit hard. This implementation has two advantages over the
>    basic approach: it implicitly computes a Hessian in *O(n)* time where *
>    n* is the number of features and it operates out of core, hence making
>    it applicable to datasets that don’t conveniently fit in RAM. The learned
>    predictor is compatible with the default, although you’ll notice that a
>    factor of 8 more RAM is required when learning.
>    5. Matt Hoffman <http://www.cs.princeton.edu/~mdhoffma/> and I
>    implemented Online Latent Dirichlet 
> Allocation<http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>.
>    This code is still experimental and likely to change over the next week. It
>    really does a minibatch update under the hood. The code appears to be
>    substantially faster than Matt’s earlier python implementation making this
>    probably the most efficient LDA anywhere. LDA is still much slower than
>    online linear learning as it is quite computationally heavy in
>    comparison—perhaps a good candidate for GPU optimization.
>    6. Nikos, Daniel, and I have been experimenting with more online
>    cluster parallel learning algorithms (–corrective, –backprop,
>    –delayed_global). We aren’t yet satisfied with these although they are
>    improving. Details are at the LCCC workshop<http://lccc.eecs.berkeley.edu/>
>    .
>
> In addition, Ariel <http://www.yendor.com/> added a test suite, 
> Shravan<http://research.yahoo.com/Shravan_Narayanamurthy> helped
> with ngrams, and there are several other minor new features and bug fixes
> including a very subtle one caught by 
> Vaclav<http://www.occamslab.com/petricek/>
> .
>
> The documentation on the website hasn’t kept up with the code. I’m planning
> to rectify that over the next week, and have a new tutorial starting at 2pm
> in the LCCC <http://lccc.eecs.berkeley.edu/schedule.html> room for those
> interested. Yes, I’ll not be skiing [image: :)]
>
>

Reply via email to