> I'd be interested to know how you made the "PCA work for predefined tags".
I have to first say that the feed plugin in chandler 0.7alpha3 is so different from its counterpart in 0.6.1. So all my work are on 0.6.1 code.
1. Detailed implementation of classification for predefined tags. 
1.1 Obtaining labeled data
    To make ContentItem tagging a supervised learning task. I chose the feed channel, with slashdot.org news as input. The category (more acurate, 'section') of the article is used as pre-defined label.
    Two changes are made to chandler feed channel implementation to achieve this task.   The code is based on version 0.6.1.
     a) although chandler displays 'category' of the article, it is actually incorrectly mapped to 'subject' field of the feed. I changed this to 'section' field. So it better reflects slashdot.org  categories.
     b) the original feed channel just do retrieval once, so the feed articles are limited to 10-20, I added time delay in the fetching code, so could retrieve up to 300 articles.
    After these modifications, I am able to get 300 articles of 5 pre-defined categories. (please refer to slashdot.org for its categoriztion).
    file modified:
           parcel/feeds/block.py
           parcel/feeds/channel.py

1.2 build the category vectors
   
it is very lucky that lucene has its api for term vector access. (please refer to 'lucene in action' book,  chapter 5 for details about term vectors). This satisfies a critical need for PCA. Term vector is a tuple of { 'term', 'frequency'} pairs, and maps one-to-one to each document. It contains all the information we need for PCA calculation. Just the format is a little different from a regular matrix row representation (to think each vector is a compact representation of a matrix row with only the none-zero elements). The 300 articles are able to be reprented in form below, of course they has been greatly simplified:
    Article 1:  { {'ipod', 2}, {'california', 1}, {'sales', 1}}    
    Article 2:  { {'spam', 1}, {'microsoft', 1}}.
    .....
    Article 300: ....
By extending each vector to be of a length that equals to the number of distinct words in all 300 articles, these 300 vectors form a regular matrix.
    Article 1:  { {'ipod', 2}, {'california', 1}, {'sales', 1}, {'spam', 0}, {'microsoft', 0}}    
    Article 2:  { {'ipod', 0}, {'california', 0}, {'sales', 0}, {'spam', 1}, {'microsoft', 1}}.
    .....
    Article 300: ....
Then PCA is used to transform the matrix. Articles that have same categories are summed up in transformed space, and thus "category vector" is obtained for each category. 
    file modified:
           parcel/feeds/block.py
           parcel/feeds/channel.py
           repository/persistence/FileContainer.py
           repository/persistence/DBRepository.py
           repository/persistence/DBTermIO.py

1.3 classify the new article
   then, when a new article comes in. It is firstly transformed in PCA space and then is computed consine distance with the category vectors. Cosine distance is as
        cosine(v1, v2) = [v1 (dot product) v2]/ [ length(v1) * length(v2)]
The category with smallest cosine value is chosen to be the category the new article belongs to. Actually for my test, I did not choose to classify new article, but used part of the 300 articles as unlabeled. The reason is that to retrieve new article will entail quite some coding in modified feed parcel. The classification accuracy is about 76%.
    file modified:
           parcel/feeds/block.py
           parcel/feeds/channel.py
           repository/persistence/FileContainer.py
           repository/persistence/DBRepository.py
           repository/persistence/DBTermIO.py

> I've high hopes that the Lucene synonym recognizer is going to help a
> lot here to extract named entities beyond this small set. This is  a
> great set though to start with.

2. Will synonym help in named entity extraction?
I don't think so. The reason is that synonym and named entitie sets are two differnent things. Lucene use wordnet for synonym sets generation. For example, look up 'paris' in wordnet, you will find the following synset:
    "city of light", "French capital".
But will never find "London", although both "paris" and "london" are named city entities. Same for people's name. Entity recognition has to be done by NLP routines. In the mean time, some named entities could be of large importance and they could serve directly as the tag of the ContentItem. That's my idea of how unsupervised tagging could be archieved, yet I admit it's very hard and think this will be of good work for Diana.  There need to be some heuristic rules, such as:
     if  there is <people name> and term "meeting" in the ContentItem, and they form a sentence where the <people name> is subject
     then use <people name> as one of the tags for the ContentItem
This work is really out of my scope yet I would like to give help at my best to experiment the rules. The result is just uncertain, depend s on how well they are tuned.

3. Minipar
I use Minipar is just because it is a lightweight library that could help me to get a fast prototype.
> - What is MINIPAR's license?
     In its readme file, it says "A royalty-free license is granted for the use of this software for NON_COMMERCIAL PURPOSES ONLY"
> - Does MINIPAR gives you a clue as to what the semantic of the entity
> is? (i.e. if it's a people name or location name)
     Yes, it does.
> - What about prefixing such entities with their semantic class? ( e.g.
> "Smith" is coded in the content_vector as "People:Smith"). That way, if
> we extract other semantic entities from Chandler's data, we could cross
     This is feasible in chandler. As lucene supports a keyword for each term. In chander, currently such kine of keywords has already been used when persisting data into lucene index. This is a very good idea.

4. Issues
4.1 following chandler evolution
The code change from 0.6.1 to 0.7alpha3 is REALLY huge. I honestly do not think I will be able to have a 0.7alpha3 porting for MVA in the SoC timeframe, with my own effort. I am confident to have things working on 0.6.1 though.
4.2 more interested party involved
I hope my explanation in section 1 is clear. If so, we could safely draw the conclusion that MVA for pre-defined tags classfication is a fully practical thing in chandler. That is good news and proves the idea is not a toy. It would be great if more people in OSAF will be interested in and participate the discussion on this topic.
I would say that NLP, which covers part of section 3 and 4, is also very hopeful and will be pratical to implement, too.

Best,
Xun
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Reply via email to