FYI, since it seems most of the movement in image classification has been in GLAM/arts, I thought folks might find this interesting: in the GLAM world, the Barnes Foundation has been a leader as their collections management has been unconventional and based more on image similarity than by traditional taxonomic classification.
You may find their observations on experimenting with machine learning interesting: https://medium.com/barnes-foundation/using-computer-vision-to-tag-the-collection-f467c4541034 https://github.com/BarnesFoundation/barnes-tms-extract/blob/master/DATASCIENCE.md https://collection.barnesfoundation.org/ http://www.attractionsmanagement.com/index.cfm?pagetype=news&codeID=338394 On Tue, Feb 12, 2019 at 6:25 AM Andrew Lih <[email protected]> wrote: > FYI, folks might be interested in what we've been doing with The Met > Museum in NYC and machine learning. Writeup in the latest GLAM newsletter > > > https://outreach.wikimedia.org/wiki/GLAM/Newsletter/January_2019/Contents/USA_report > > TL;DR - Andrew worked with Jennie Choi, The Met's General Manager of > Collection Information and Nina Diamond, Managing Editor and Producer along > with Microsoft Researchers Patrick Buehler, J.S. Tan and Sam Kazemi Nafchi > to train a machine learning model on Microsoft Azure that could predict > labels for artworks. Using the Met's roughly 1,000 word art vocabulary, and > representative images to help train the model a proof of concept app was > developed at the hackathon. The results were impressive enough that Andrew > finished the creation of a Wikdata Distributed Game - Depicts to connect > the subject keyword recommendations to Wikidata. > > -Andrew > > > On Thu, Feb 7, 2019 at 2:06 PM Nuria Ruiz <[email protected]> wrote: > >> Team, >> >> Since everyone is here, we will be working on a machine learning >> infrastructure program this year. I will set up meetings with everyone on >> this thread and some others in SRE and Audiences to get a "bag of requests" >> of things that are missing, first round of talks that I hope to finish next >> week is to hear what everyone requests/ideas are. Will be sending meeting >> invites today and tomorrow. I think from those some themes will emerge. >> Thus far, it is pretty clear we need a better way to deploy models to >> production (right now we deploy those to elastic search in very crafty >> manners, for example) , we need to have an answer to GPU issues to train >> models, we need to have a "recommended way" in which we train and compute, >> some unified system for tracking models+data+tests and finally, there are >> probably many learnings the work been done in Ores thus far. >> >> Thanks, >> >> Nuria >> >> >> On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi <[email protected]> wrote: >> >>> Hey Andrew! >>> >>> Thank you so much for sharing this and start this conversation. We had a >>> meeting at All Hands with all people interested in "Image Classification" >>> https://phabricator.wikimedia.org/T215413 , and one of the open >>> questions was exactly how to find a "common repository" for ML models that >>> different groups and products within the organization can use. So, please, >>> count me in! >>> >>> Thanks, >>> >>> M >>> >>> >>> On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker <[email protected]> >>> wrote: >>> >>>> Just gave the article a quick read. I think this article pushes on >>>> some key issues for sure. I definitely agree with the focus on >>>> python/jupyter as essential for a productive workflow that leverages the >>>> best from research scientists. We've been thinking about what ORES 2.0 >>>> would look like and event streams are the dominant proposal for improving >>>> on the limitations of our queue-based worker pool. >>>> >>>> One of the nice things about ORES/revscoring is that it provides a nice >>>> framework for operating using the *exact same code* no matter the >>>> environment. E.g. it doesn't matter if we're calling out to an API to get >>>> data for feature extraction or providing it via a stream. By investing in >>>> a dependency injection strategy, we get that flexibility. So to me, the >>>> hardest problem -- the one I don't quite know how to solve -- is how we'll >>>> mix and merge streams to get all of the data we want available for feature >>>> extraction. If I understand correctly, that's where Kafka shines. :) >>>> >>>> I'm definitely interested in fleshing out this proposal. We should >>>> probably be exploring the processes for training new types of models (e.g. >>>> image processing) using different strategies than ORES. In ORES, we're >>>> almost entirely focused on using sklearn but we have some basic >>>> abstractions for other estimator libraries. We also make some strong >>>> assumptions about running on a single CPU that could probably be broken for >>>> some performance gains using real concurrency. >>>> >>>> -Aaron >>>> >>>> On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic < >>>> [email protected]> wrote: >>>> >>>>> Hi Andrew, >>>>> >>>>> I have recently started a six month AI/Machine Learning Engineering >>>>> course which focuses exactly on the topics that you've shown interest in. >>>>> >>>>> So, >>>>> >>>>> >>> I'd love it if we had a working group (or whatever) that focused >>>>> on how to standardize how we train and deploy ML for production use. >>>>> >>>>> Count me in. >>>>> >>>>> Regards, >>>>> Goran >>>>> >>>>> >>>>> Goran S. Milovanović, PhD >>>>> Data Scientist, Software Department >>>>> Wikimedia Deutschland >>>>> >>>>> ------------------------------------------------ >>>>> "It's not the size of the dog in the fight, >>>>> it's the size of the fight in the dog." >>>>> - Mark Twain >>>>> ------------------------------------------------ >>>>> >>>>> >>>>> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto <[email protected]> wrote: >>>>> >>>>>> Just came across >>>>>> >>>>>> https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tensorflow >>>>>> >>>>>> In it, the author discusses some of what he calls the 'impedance >>>>>> mismatch' between data engineers and production engineers. The links to >>>>>> Ubers Michelangelo <https://eng.uber.com/michelangelo/> (which as >>>>>> far as I can tell has not been open sourced) and the Hidden >>>>>> Technical Debt in Machine Learning Systems paper >>>>>> <https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf> >>>>>> are >>>>>> also very interesting! >>>>>> >>>>>> At All hands I've been hearing more and more about using ML in >>>>>> production, so these things seem very relevant to us. I'd love it if we >>>>>> had a working group (or whatever) that focused on how to standardize how >>>>>> we >>>>>> train and deploy ML for production use. >>>>>> >>>>>> :) >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>> >>>> >>>> -- >>>> >>>> Aaron Halfaker >>>> >>>> Principal Research Scientist >>>> >>>> Head of the Scoring Platform team >>>> Wikimedia Foundation >>>> _______________________________________________ >>>> Research-Internal mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/research-internal >>>> >>> _______________________________________________ >>> Research-Internal mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/research-internal >>> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > -- > -Andrew Lih > Author of The Wikipedia Revolution > US National Archives Citizen Archivist of the Year (2016) > Knight Foundation grant recipient - Wikipedia Space (2015) > Wikimedia DC - Outreach and GLAM > Previously: professor of journalism and communications, American > University, Columbia University, USC > --- > Email: [email protected] > WEB: https://muckrack.com/fuzheado > PROJECT: Wikipedia Space: http://en.wikipedia.org/wiki/WP:WPSPACE > > -- -Andrew Lih Author of The Wikipedia Revolution US National Archives Citizen Archivist of the Year (2016) Knight Foundation grant recipient - Wikipedia Space (2015) Wikimedia DC - Outreach and GLAM Previously: professor of journalism and communications, American University, Columbia University, USC --- Email: [email protected] WEB: https://muckrack.com/fuzheado PROJECT: Wikipedia Space: http://en.wikipedia.org/wiki/WP:WPSPACE
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
