Re: [Analytics] [Research-Internal] Article about ML in production woes

Andrew Lih Tue, 12 Feb 2019 03:33:03 -0800

FYI, since it seems most of the movement in image classification has been
in GLAM/arts, I thought folks might find this interesting: in the GLAM
world, the Barnes Foundation has been a leader as their collections
management has been unconventional and based more on image similarity than
by traditional taxonomic classification.


You may find their observations on experimenting with machine learning
interesting:

https://medium.com/barnes-foundation/using-computer-vision-to-tag-the-collection-f467c4541034
https://github.com/BarnesFoundation/barnes-tms-extract/blob/master/DATASCIENCE.md

https://collection.barnesfoundation.org/
http://www.attractionsmanagement.com/index.cfm?pagetype=news&codeID=338394




On Tue, Feb 12, 2019 at 6:25 AM Andrew Lih <[email protected]> wrote:

> FYI, folks might be interested in what we've been doing with The Met
> Museum in NYC and machine learning. Writeup in the latest GLAM newsletter
>
>
> https://outreach.wikimedia.org/wiki/GLAM/Newsletter/January_2019/Contents/USA_report
>
> TL;DR - Andrew worked with Jennie Choi, The Met's General Manager of
> Collection Information and Nina Diamond, Managing Editor and Producer along
> with Microsoft Researchers Patrick Buehler, J.S. Tan and Sam Kazemi Nafchi
> to train a machine learning model on Microsoft Azure that could predict
> labels for artworks. Using the Met's roughly 1,000 word art vocabulary, and
> representative images to help train the model a proof of concept app was
> developed at the hackathon. The results were impressive enough that Andrew
> finished the creation of a Wikdata Distributed Game - Depicts to connect
> the subject keyword recommendations to Wikidata.
>
> -Andrew
>
>
> On Thu, Feb 7, 2019 at 2:06 PM Nuria Ruiz <[email protected]> wrote:
>
>> Team,
>>
>> Since everyone is here, we will be working on a machine learning
>> infrastructure program this year. I will set up meetings with everyone on
>> this thread and some others in SRE and Audiences to get a "bag of requests"
>> of things that are missing, first round of talks that I hope to finish next
>> week is to hear what everyone requests/ideas are.  Will be sending meeting
>> invites today and tomorrow.  I think from those some themes will emerge.
>> Thus far,  it is pretty clear we need a better way to deploy models to
>> production (right now we deploy those to elastic search in very crafty
>> manners, for example) , we need to have an answer to GPU issues to train
>> models, we need to have a "recommended way" in which we train and compute,
>> some unified system for tracking models+data+tests and finally, there are
>> probably many learnings the work been done in Ores thus far.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>> On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi <[email protected]> wrote:
>>
>>> Hey Andrew!
>>>
>>> Thank you so much for sharing this and start this conversation. We had a
>>> meeting at All Hands with all people interested in "Image Classification"
>>> https://phabricator.wikimedia.org/T215413 , and one of the open
>>> questions was exactly how to find a "common repository" for ML models that
>>> different groups and products within the organization can use. So, please,
>>> count me in!
>>>
>>> Thanks,
>>>
>>> M
>>>
>>>
>>> On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker <[email protected]>
>>> wrote:
>>>
>>>> Just gave the article a quick read.  I think this article pushes on
>>>> some key issues for sure.  I definitely agree with the focus on
>>>> python/jupyter as essential for a productive workflow that leverages the
>>>> best from research scientists.  We've been thinking about what ORES 2.0
>>>> would look like and event streams are the dominant proposal for improving
>>>> on the limitations of our queue-based worker pool.
>>>>
>>>> One of the nice things about ORES/revscoring is that it provides a nice
>>>> framework for operating using the *exact same code* no matter the
>>>> environment.  E.g. it doesn't matter if we're calling out to an API to get
>>>> data for feature extraction or providing it via a stream.  By investing in
>>>> a dependency injection strategy, we get that flexibility.  So to me, the
>>>> hardest problem -- the one I don't quite know how to solve -- is how we'll
>>>> mix and merge streams to get all of the data we want available for feature
>>>> extraction.  If I understand correctly, that's where Kafka shines.  :)
>>>>
>>>> I'm definitely interested in fleshing out this proposal.  We should
>>>> probably be exploring the processes for training new types of models (e.g.
>>>> image processing) using different strategies than ORES.  In ORES, we're
>>>> almost entirely focused on using sklearn but we have some basic
>>>> abstractions for other estimator libraries.  We also make some strong
>>>> assumptions about running on a single CPU that could probably be broken for
>>>> some performance gains using real concurrency.
>>>>
>>>> -Aaron
>>>>
>>>> On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> I have recently started a six month AI/Machine Learning Engineering
>>>>> course which focuses exactly on the topics that you've shown interest in.
>>>>>
>>>>> So,
>>>>>
>>>>> >>>  I'd love it if we had a working group (or whatever) that focused
>>>>> on how to standardize how we train and deploy ML for production use.
>>>>>
>>>>> Count me in.
>>>>>
>>>>> Regards,
>>>>> Goran
>>>>>
>>>>>
>>>>> Goran S. Milovanović, PhD
>>>>> Data Scientist, Software Department
>>>>> Wikimedia Deutschland
>>>>>
>>>>> ------------------------------------------------
>>>>> "It's not the size of the dog in the fight,
>>>>> it's the size of the fight in the dog."
>>>>> - Mark Twain
>>>>> ------------------------------------------------
>>>>>
>>>>>
>>>>> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto <[email protected]> wrote:
>>>>>
>>>>>> Just came across
>>>>>>
>>>>>> https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tensorflow
>>>>>>
>>>>>> In it, the author discusses some of what he calls the 'impedance
>>>>>> mismatch' between data engineers and production engineers.  The links to
>>>>>> Ubers Michelangelo <https://eng.uber.com/michelangelo/> (which as
>>>>>> far as I can tell has not been open sourced) and the Hidden
>>>>>> Technical Debt in Machine Learning Systems paper
>>>>>> <https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf>
>>>>>>  are
>>>>>> also very interesting!
>>>>>>
>>>>>> At All hands I've been hearing more and more about using ML in
>>>>>> production, so these things seem very relevant to us.  I'd love it if we
>>>>>> had a working group (or whatever) that focused on how to standardize how 
>>>>>> we
>>>>>> train and deploy ML for production use.
>>>>>>
>>>>>> :)
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Aaron Halfaker
>>>>
>>>> Principal Research Scientist
>>>>
>>>> Head of the Scoring Platform team
>>>> Wikimedia Foundation
>>>> _______________________________________________
>>>> Research-Internal mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/research-internal
>>>>
>>> _______________________________________________
>>> Research-Internal mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/research-internal
>>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> -Andrew Lih
> Author of The Wikipedia Revolution
> US National Archives Citizen Archivist of the Year (2016)
> Knight Foundation grant recipient - Wikipedia Space (2015)
> Wikimedia DC - Outreach and GLAM
> Previously: professor of journalism and communications, American
> University, Columbia University, USC
> ---
> Email: [email protected]
> WEB: https://muckrack.com/fuzheado
> PROJECT: Wikipedia Space: http://en.wikipedia.org/wiki/WP:WPSPACE
>
>

-- 
-Andrew Lih
Author of The Wikipedia Revolution
US National Archives Citizen Archivist of the Year (2016)
Knight Foundation grant recipient - Wikipedia Space (2015)
Wikimedia DC - Outreach and GLAM
Previously: professor of journalism and communications, American
University, Columbia University, USC
---
Email: [email protected]
WEB: https://muckrack.com/fuzheado
PROJECT: Wikipedia Space: http://en.wikipedia.org/wiki/WP:WPSPACE

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Research-Internal] Article about ML in production woes

Reply via email to