Hi Xun,
I posted my SoC review to Google and I expect the process over there to
be expedited swiftly. I think we have now thanks to your work several
interesting things:
- a working parcel to make experiments
- a set of data performed on Slashdot feeds showing that auto tagging
after a learning phase is feasible
- a path to incorporate PCA into Chandler (using MDP)
I've been discussing with vikSIT on IRC what should be the next steps.
Eventually, the goal will be to have auto tagging turned on (optionally)
on the trunk but we won't even see that in Chandler before we have a
tagging feature in (planned for alpha6). In the meantime, there's a set
of short term things we should do:
1- clean up
http://wiki.osafoundation.org/bin/view/Journal/InternProjectMVA : I
posted a set of comments on this page back in July but they still
haven't be properly incorporated (assuming I was right). I think you
should do that Xun.
2- update the parcel to be running against the 0.7alpha4 trunk code:
right now, it runs against 0.7alpha1 which is kind of old. I know that
the egg stuff sort of threw a wrench in your plan but we should get pass
that hurdle. May be vikSIT or other contributors (Markku?) could help there.
4- incorporate MDP library: we need to find a place where to park this
and start to use it. Classically in Chandler we download tarballs and
create a specific Makefile for such projects under /external. That's
what we do for icu, PyLucene, twisted and the like so it looks like we
should do just the same for MDP. May be bear can help / give us some
advice here.
5- make the relevant changes in the MVA project to call the MDP library:
the first thing will be to compute the eigenvectors correctly. Right now
the code simply compute an average vector (cumulative, non normalized)
per tag (till we reach a given threshold) then project the new vectors
on them. The threshold for accumulation is not grounded into data, the
threshold for attribution (when projecting) is not grounded into data
either, we run no analysis to which dimension contribute to variance or
not. We need to improve all of that and use the MDP calls for that.
For the time being, we'll continue to use sandbox/xluo as a repository
for this code. Eventually, we'll want to move that to chandler/projects
once we prove that it's worthwhile but, for the moment, it would be a
drag on the project to maintain that code off the trunk (it would be
submitted to all QA/build/release engineering constraints we have for
everything that lands on the trunk) and most of the would be
contributors don't have svn trunk privileges anyway. However working
with patches off several sandboxes will be annoying so I'm proposing
that we consider sandbox/xluo as the official repo for MVA.
Also, I propose we continue with the Slashdot experiment you started
though, clearly, we'll have to grow out of it in a while but I see no
reason to do that as long as we don't have a better MVA model.
What do you guys think?
BTW, just as a roll call, who on this list would be interested to
contribute to this project moving forward?
Cheers,
- Philippe
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev