AbdealiJK created this task.
Herald added subscribers: pywikibot-bugs-list, Aklapper.

TASK DESCRIPTION
  Port catimages.py -> pywikibot-core
  ===================================
  
  1. Personal Information
  2. Name : Abdeali J Kothari
  3. Email : [email protected]
  4. Github : https://github.com/AbdealiJK
  5. IRC nick : AbdealiJK
  6. Time Zone : UTC+5:30 (IST - India)
  7. Typical working time : 9:00am to 6:00pm IST
  8. Location : Chennai, India
  9. School / Degree : B.Tech. Engineering Physics at Indian Institute of 
Technology, Madras. Expected to graduate in - August 2016
  
  Abstract
  --------
  
  The aim of the project is to bring to life the catimages.py script fro 
pywikibot-compat. This involves heavy refactoring of the script. While doing 
this refactoring, it’d be useful to modularize the script and make it a generic 
package. The generic package then can be used in pywikibot to provide the same 
functionality as it used to provide earlier.
  
  - **Possible Mentors** : DrTrigon (@DrTrigon), John Vandenberg (@jayvdb)
  - **Languages used** : Python and C/C++ for opencv dependency
  - **Related phabricator issue** : https://phabricator.wikimedia.org/T66838
  
  Project Description
  -------------------
  
  Motivation
  ----------
  
  **To wikimedia**: catimages.py brings about automation in categorizing 
images. Its an invaluable tool to have, which can be extremely accurate 
considering the recent innovation in Computer Vision (CV). Using catimages.py 
we can give uploads more meaningful metadata to work with.
  
  **To pywikibot**: We can bring back automated categorization without manual 
intervention to pywikibot with this project. Pywikibot already has the scripts 
imagerecat.py and checkimages.py. imagerecat doesn’t work right now as it uses 
CommonSense from wikisense which is dead 
(https://phabricator.wikimedia.org/T60869#1365653). And checkimages requires 
manual input. As catimages attempts to categorize without any prior information 
(using just the file itself), if done right, it would be easier to use.
  
  **To the rest of the world**: Wikimedia is all about providing more data to 
everyone. One awesome outcome of moving catimages.py to pywikibot-core is that 
all the dependencies either become external pypi packages or get moved upstream 
to other packages. This is really good as everyone can use it ! With the 
pycolorname package that I developed in the microtask, I was able to make the 
following package <https://github.com/AbdealiJK/pycolorname> - which is a great 
resource for developers handling color. Even non-developers can refer to the 
charts 
<http://abdealijk.github.io/pycolorname/#/chart/pycolorname.pantone.pantonepaint.json>
 generated by it.
  
  Implementation
  --------------
  
  The project can be broadly divided into 2 parts, the first involves 
dependency checking and updating the code to fix deprecated deps and general 
refactoring. The second part involves optimizing the script to work better and 
faster.
  
  To explain the first part, we need to take a look at all the libraries that 
catimages.py can handle (Refer to the table at 
https://phabricator.wikimedia.org/T66838). As can be see in the table, the 
following needs to be done to dependencies:
  
  - Some packages are usable directly
  - Some need to be patched upstream
  - Some need to be packaged and uploaded to pypi
  - Some need to be replaced as they are deprecated
  
  The second part of optimizing the script catimages.py does not have a 
concrete game plan right now. This will be decided over the course of the 
project. I have worked on similar domains at my university. Possible algorithms 
that could be used:
  
  - r-cnn <http://arxiv.org/abs/1311.2524> (2014, Girschick et. al.) and for 
faster machines this may be a good idea.
  - SPP-net <http://arxiv.org/abs/1311.2524> (2015, Shaoqing Ren et. al.) is a 
faster version of rcnn but it takes longer to train. This may be beneficial in 
certain scenarios though.
  - LCKSVD <http://www.umiacs.umd.edu/~lsd/papers/CVPR2011_LCKSVD_final.pdf> is 
another algorithm I’ve used in the past able to get not-so-great results, but 
it works very fast.
  
  One simple idea should be to provide a good interface with VLFeat 
<http://www.vlfeat.org/> - a library of Computer Vision algorithms specializing 
in image understanding. we can integrate vlfeat into the package and use the 
common algorithms directly. Although these algorithms are old (most are 
published in 2012 or earlier) they are useful and the library itself is 
flexible. This has the benefit that we will not have to maintain the algorithm 
specific code. Also, VLFeat which was created to maintain a set of these 
algorithms will probably be better at it.
  This may not only be about the algorithm, but also using other data sets. In 
specific I would like to try using the ILSVRC (or ImageNet) dataset to improve 
classification accuracy. The benefit of using this dataset is that it is an 
order of magnitude larger than Pascal (detailed comparison can be found here 
<http://image-net.org/challenges/LSVRC/2013/>), implying we have more 
supervised training.
  
  Timeline
  --------
  
  Community Bonding (April 22 - May 22)
  -------------------------------------
  
  The two major aspects of the project that should be completed here are:
  
  1. Hack catimages.py (and deps) to be usable without pywikibot on a personal 
github repository. This is to get the basic functionality usable and testable.
  2. Understanding the math behind catimages.py. The algorithm and logic used 
in the script can be refactored to use newer method from opencv, sklearn, and 
scipy which are now optimized and more stable.
  3. Find methods to optimize catimages.py using new techniques
  
  Week 1 (May 23 - May 29)
  ------------------------
  
  The aim in Week 1 is handling all the modules which have patches. The reason 
to do this first is because it involves 3rd party developers who may take time 
to reply if upstream changes are needed. This includes the modules: bob, jseg, 
music21, xbob-flandmark.
  What needs to be done is check out why the patches have been applied. 
Identify whether the patches are still needed and do the appropriate changes 
upstream or inside catimages.py. Interestingly, most of these patches have 
small changes which simply add paths to sys.path or specify args. Hence, this 
can be solved within a week.
  
  Week 2 (May 30 - June 5)
  ------------------------
  
  The aim for Week 2 is to replace unsupported packages and make pypi packages 
for packages which currently rely on archives (like .zip). Unsupported packages 
are: zbar and pyexiv2. Archives are: yaafelib, slic, jseg, and jseg/jpeg-6b.
  We need to move these into github repositories and deploy to pypi if needed. 
And then update catimages.py to use their new versions.
  
  Week 3 (June 6 - June 12)
  -------------------------
  
  In week 3, it would be a good idea to pause and review the code written so 
far, fixing any minor issues and so on that can occur with the code. This is 
also a buffer week incase any work from earlier weeks is still pending. I’d 
also like to write a blog post about my experience so far.
  
  Week 4 (June 13 - June 19)
  --------------------------
  
  In week 4, the aim is to begin revamping the opencv code. First off cv is 
deprecated and cv2 needs to be used instead. Also, the latest cv2 python 
bindings are are not backward compatible and need to be updated. This would 
give me good understanding of the Computer Vision part in catimages.py.
  
  Week 5 - Midterm Evaluation (June 20 - June 26)
  -----------------------------------------------
  
  The opencv package is a little problematic. It has custom C++/C code (lots of 
it) which needs to be refactored. Hence, I’ll be using my knowledge (from last 
week and community bonding) about how sklearn and OpenCV can clean up the 
opencv package.
  Hence, for the midterm evaluation I plan to have a pypi package which can 
perform all the functions that catimages.py could. This would be independant of 
pywikibot-core, probably just as a simple package. Let’s call this pypi package 
pypi-catimages to avoid confusion. Note: Suggestions for the name are welcome :)
  
  Week 6 (June 27 - July 3)
  -------------------------
  
  By Week 6 I plan to begin integrating the prior created pypi package with 
pywikibot. This would be to use cli arguments from pywikibot, use the 
page-generators in core, and update all pywikibot related functionality.
  Not quite sure where this script will be. Possible approaches are:
  
  - pywikibot-core and pypi-catimages are requirements for another pypi package 
pywikibot-catimages for it to be used with mediawiki.
  - A script is created in pywikibot-core which uses the pypi-catimages and 
handles the args required for the script to be used with mediawiki.
  
  The outcome here would be a Pull Request / Patch Set which adds functionality 
to interface with pywikibot to the pypi-catimages.
  
  Week 7 (July 4 - July 10)
  -------------------------
  
  In Week 7, again, a pause for testing and review of code being pushed to 
pywikibot-core. I would want to write unit tests here and make sure the PR gets 
reviewed and accepted. This would be a good time to make a blog post about how 
catimages.py was ported so that other scripts that still reside in compat can 
refer to the methods I used. This is also a buffer week. In case one of the 
above weeks did not go according to plan, this is the week to fix this problem.
  
  Week 8 (July 11 - July 17)
  --------------------------
  
  Week 8 marks the beginning of the second part of GSoC and the second stage of 
this project too. Here, I would like to do some research on computer vision. 
Probably recent literature on classification. It would be good to use some 
newer CV techniques with various pros/cons. We could bundle this into 
pypi-catimages and allow the user to choose which algorithm to use.
  As mentioned earlier, this may not only be about the idea, but also using 
other data sets like ILSVRC to improve classification accuracy.
  
  Week 9 (July 18 - July 24)
  --------------------------
  
  In Week 9, I’ll be implementing things based on the above analysis. And form 
a generic interface to use alternative algos or datasets in pypi-catimages.
  
  Week 10 (July 25 - July 31)
  ---------------------------
  
  In Week 10, once again it’s time for a buffer week. I’d love to create a blog 
post at this time about the different algorithms I have researched about in the 
past 2 weeks. Along with this: reviews, documentation, and tests will be needed 
to get this part merged.
  
  Week 11 (Aug 1 - Aug 7)
  -----------------------
  
  In Week 11, as I am not sure if I will be equally free as the earlier weeks 
(See section on other commitments), I’d like to improve the usability of the 
pywikibot interface to catimages.py - which does not require as much math. A 
few things that can be implemented are:
  Prompt the user for what category should be used if the bot finds that it is 
not sure of the category. i.e. No category has a considerably higher 
probability than the rest.
  Logging results appropriately for corner cases in case the user wants to 
check results after a bulk run and then see all the results later for 
verification.
  Allow user to give some hints on the category (Maybe a list of probably 
categories) and ignore other categories even if they rank higher than this one.
  
  Week 12 (Aug 8 - Aug 14)
  ------------------------
  
  Week 12 is meant to be a buffer week, in case something did not go as 
planned. This is to complete any backlogs, finish up unit-tests, add more 
documentation, etc. One key thing I’d like to work on in this week is 
documentation on how to setup catimages.py and how to use it. It would be good 
to provide more information for future contributors.
  
  Week 13 (Aug 15 - Aug 23)
  -------------------------
  
  Week 13 is also meant to be a buffer week. All pending documentation and 
tests will be completed during this time.
  I’ll also be writing my final blog post and wikitech-I mail before the 
deadline sometime this week.
  
  Extra time:
  -----------
  
  In the event that I have extra time because of the project going exceedingly 
well (I can hope, can’t I? :D), I’d like to work on testing. I’ve seen posts on 
the mailing list about unit-tests failing and there’s a tracking issue about 
the unit tests and integration tests (https://phabricator.wikimedia.org/T67192 
and https://phabricator.wikimedia.org/T72336). Testing is something that I’ve 
not done enough of and think it’s an avenue where I can learn and help the 
community simultaneously. Another thing I was interested in is the python3 
support.
  
  1. Current Experience
  2. Have set up the mediawiki core and pywikibot-core on my local machine.
  3. Familiarity with code and coding conventions in pywikibot.
  4. Worked on the micro tasks: https://phabricator.wikimedia.org/T76211 
related to catimages.py, https://phabricator.wikimedia.org/T67192 related to 
pywikibot-core.
  5. Adept in python. Refactored pycolorname (gerrit 
<https://gerrit.wikimedia.org/r/#/admin/projects/pywikibot/pycolorname> -> 
github <https://github.com/AbdealiJK/pycolorname>) to use classes and work with 
python 2 & 3.
  6. Good familiarity with continuous integration - set up circleci 
<https://circleci.com/> in pycolorname's new github repo 
<https://github.com/AbdealiJK/pycolorname>.
  7. Good understanding of setuptools and pip - setup pypi deployment 
<https://pypi.python.org/pypi/pycolorname> in pycolorname. Also set up 
automated nightly releases using rultor <http://www.rultor.com/>.
  8. Basic understanding of Computer Vision, Math and Data Analytics (through 
courses at my university).
  
  About Me
  --------
  
  I am a fourth year undergraduate from Indian Institute of Technology, Madras. 
I was passionate about programming and development since High School, and got 
introduced to the world of FLOSS which made me interested to get involved in 
contributing. I have been involved in a few hackathons held in my college. My 
enthusiasm in developing FLOSS was intensified after interacting with Richard 
Stallman, the founder of the Free Software Movement who was in my college in 
2014.
        I started my journey in FLOSS by hacking with coala, and got introduced 
to the gnome community from it. I participated in GSoC with them and had an 
amazing experience with the folks there. I’ve been interested in the wikimedia 
community since a friend of mine (Kunal Grover) had done his GSoC with 
wikimedia. Hence, I wish to participate and get to know the awesome community 
that created the website I used most in the 4 years at my university :)
  I always try to stay logged into IRC (Channels: mediawiki, wikimedia-dev and 
pywikibot) during my working hours and will try to contribute back to the 
community as much as I can (code, documentation, and IRC). I am regular in 
replying to emails, hangout chats, phabricator, gerrit, and IRC (as long as my 
name is mentioned in the IRC chat). All source code written by me will be 
regularly published, reviewed and improved. I will keep my mentors updated 
about the progress through e-mails, IRC, or phabricator. All discussions 
regarding design and implementation will be public.
  
  Other commitments (May 23 to August 23)
  ---------------------------------------
  
  I am currently in my final semester and have 3 courses going on. My final 
thesis viva is tentatively around May 15 and I will have no course work after 
that. 
  Other than this, I have no plans for the summer. I propose to spend about 
35-40 hours every week on this project. Towards the end of the GSoC period I 
may begin work for my job after university. The dates for this have not been 
fixed, and I will plan appropriately to not affect my GSoC in any way. 
Obviously, these times are subject to the project status, and extra time to 
meet deadlines will not be an issue from my side.
  I will be active between 9:00am to 5:00pm, with an hour break sometime in 
between. These working times are complementary to the time my mentors are 
active.

TASK DETAIL
  https://phabricator.wikimedia.org/T129611

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AbdealiJK
Cc: Aklapper, pywikibot-bugs-list, AbdealiJK, jayvdb, DrTrigon, tahteche, 
Lethexie, droid, Jay8g



_______________________________________________
pywikibot-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs

Reply via email to