Just gave the article a quick read.  I think this article pushes on some
key issues for sure.  I definitely agree with the focus on python/jupyter
as essential for a productive workflow that leverages the best from
research scientists.  We've been thinking about what ORES 2.0 would look
like and event streams are the dominant proposal for improving on the
limitations of our queue-based worker pool.

One of the nice things about ORES/revscoring is that it provides a nice
framework for operating using the *exact same code* no matter the
environment.  E.g. it doesn't matter if we're calling out to an API to get
data for feature extraction or providing it via a stream.  By investing in
a dependency injection strategy, we get that flexibility.  So to me, the
hardest problem -- the one I don't quite know how to solve -- is how we'll
mix and merge streams to get all of the data we want available for feature
extraction.  If I understand correctly, that's where Kafka shines.  :)

I'm definitely interested in fleshing out this proposal.  We should
probably be exploring the processes for training new types of models (e.g.
image processing) using different strategies than ORES.  In ORES, we're
almost entirely focused on using sklearn but we have some basic
abstractions for other estimator libraries.  We also make some strong
assumptions about running on a single CPU that could probably be broken for
some performance gains using real concurrency.

-Aaron

On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic <
goran.milovanovic_...@wikimedia.de> wrote:

> Hi Andrew,
>
> I have recently started a six month AI/Machine Learning Engineering course
> which focuses exactly on the topics that you've shown interest in.
>
> So,
>
> >>>  I'd love it if we had a working group (or whatever) that focused on
> how to standardize how we train and deploy ML for production use.
>
> Count me in.
>
> Regards,
> Goran
>
>
> Goran S. Milovanović, PhD
> Data Scientist, Software Department
> Wikimedia Deutschland
>
> ------------------------------------------------
> "It's not the size of the dog in the fight,
> it's the size of the fight in the dog."
> - Mark Twain
> ------------------------------------------------
>
>
> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto <o...@wikimedia.org> wrote:
>
>> Just came across
>>
>> https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tensorflow
>>
>> In it, the author discusses some of what he calls the 'impedance
>> mismatch' between data engineers and production engineers.  The links to
>> Ubers Michelangelo <https://eng.uber.com/michelangelo/> (which as far as
>> I can tell has not been open sourced) and the Hidden Technical Debt in
>> Machine Learning Systems paper
>> <https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf>
>>  are
>> also very interesting!
>>
>> At All hands I've been hearing more and more about using ML in
>> production, so these things seem very relevant to us.  I'd love it if we
>> had a working group (or whatever) that focused on how to standardize how we
>> train and deploy ML for production use.
>>
>> :)
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>

-- 

Aaron Halfaker

Principal Research Scientist

Head of the Scoring Platform team
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to