Re: [Chandler-dev] An another push on MVA based automatic tagging

Markku Mielityinen Mon, 25 Sep 2006 09:08:32 -0700

Hi Viksit,

>> I will be happy to get suggestions and patches from you once I get>> the

>> initial release ready.
>
> Great. So if I interpret the current work in progress rightly - you're
> going to be releasing Xun's code into one of the releases, which can
> then be used as a platform for further enhancements to the project?

Xun's work is feeds parcel specific and we aim to go Chandler wideright from the start. As the number of affected code lines in Xun'sparcel is also small, I can see little value in porting it to newerChandler versions. Instead, I plan to go for the real real thing assoon as possible.


>> I have opened a new discussion in this forum for the selection

>> computational platform. I suggested using SciPy (or just NumPy)>> and as

>> there has not been any feedback I think that we are going to include
>> that into our Chandler distribution. Implementing the necessary MVA
>> operations is an easy task (and we can always use other complementary
>> libraries if necessary).
>
> Right, I meant one of the above. So the final thing here would be to
> include SciPy into Chandler.

I would say that this is one of the first things to do ;)

>> MY CURRENT TODO LIST:

> BTW - I'm looking at some stuff here right now, but is there> anything in

> specific which I might look at?
>

Learn to use Lucene and PyLucene. We plan to use it, at least to somedegree, in this project. It is an interesting open source projectwith applications outside ours as well so you cannot lose. Also youmight want to refresh your statistical skills like how to make LDA,QDA, PCA etc. with matrix operations (You will learn to like singularvalue decomposition here).


>> [preliminary work]

>> 1) Learn how to use PyLucene (I am currently reading a book:>> Lucene in

>> Action).
>

> Right, I've been experimenting with PyLucene myself. Got that book> too,

> infact.

Great. Like I already said now is the time to complete ones trainingon the use of this library.


>> 2) Obtain a real world data set with tags (Philippe has agreed to
>> prepare one).
>
> I see. Could you elaborate on this point a bit? By real world dataset

> with tags - you mean already tagged email with relevant content> that we

> would use to train the system?

Yes. We plan to make an empirical data set that contains over athousand real world emails with human assigned tags. This data set isthen partitioned into train, test, and validation data sets, withbootstrapping, to perform statistical inference. This is the only wayto assess how the developed system will perform in a real worldsetting. Simulated data sets tend to yield misleading results...

>> 3) Implement necessary MVA operations (model building, clustering,>> etc).

>> 4) Play with the empirical data to see how well the system actually
>> works (I am going to make a feasibility study).
>> 5) Get tagging implemented in Chandler (I need to discuss about this
>> with Grant next week when he is back here in the office).
>
> Right.
>
>> 6) Select a computational platform that is capable with matrix

>> operations and have it included into our Chandler distribution (I>> will

>> discuss about this with Bear and Heikki next week).
>> [after 1-6 have been taken care of]

>> 7) Decide the best way to implement automatic tagging>> functionality in

>> Chandler.
>> 8) Make all the necessary changes to repository schema, GUI, etc...
>

>> There is plenty of work to do so I really hope that we get a quick>> start

>> on items 1-6.
>

> Indeed. The sooner, the better :) I'd be happy to discuss further> on IRC> or email about which tasks in particular need to be taken care of> right

> now.

I appreciate your offer and will pitch ideas to you as soon as thereis something to start working with. For example once we get a dataset and a first version of the system working I am sure that there isa lot of work in evaluating different models and in fine tuning theirparameters. This is also the most fun part of the work ;) At themoment it is the lack of a data set is holding us back. On Monday Iwill open a new wiki page for the project to which I will start todocument all developments as they materialize. You should talk toPhilippe if it is possible for you to have writing rights to thiswiki page as this would make our collaboration so much easier.


Cheers,
   Markku
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Re: [Chandler-dev] An another push on MVA based automatic tagging

Reply via email to