What was done
=============
Improved script that separate the popularity-contest data into clusters
- Use sparse matrix for identify the submissions, in this case is using
row-based linked list sparse matrix and compressed sparse row matrix
- The identification of packages name into each submission was chaged
from
read line by line and get the third parameter to use an regex to get
all packages name, this greatly reduces the runtime
- Refactor the code for remove all numpy.matrix with use only the
sparse matrix
- Use multiple processors to load popularity-contest submissions, and to
generate clusters with KMeans
- Before the scripts not run completely with over 12GB of data, because
his was consuming all memory of my PC, my PC has 8GB of memory. Actually
this script is running completely with 12GB of data, but for this works,
its needs call the python garbage collector, which decreases the
performance of the script to read the submissions data
- Change to save on file only the packages of each cluster, the
submissions
packages was removed, with this the output data was reduced to 340K
- Milestone:
https://gitlab.com/AppRecommender/AppRecommender/milestones/5
To the next week
================
- Refactor the code, mainly the part of create the multiple processors
- Study more about user data security
- Make AppRecommender strategies works with the new output data format of
the
create clusters script
- Milestone: https://gitlab.com/AppRecommender/AppRecommender/milestones/7