-------
I am still using the sub word-set approach for article classification.
As the PCA works o.k. for predefined tags, I did not look into further
work to adapt more advanced stat algorithms, such as ICA and refined
sampling method. I also had made little progress in lucene wrapping,
except running through the examples in the "lucene in action" book.

++++
PCA and similar dimension reduction methods could keep track of basic
semantics, but obviously lack of the power of capturing high-level
semantics, such as the example Phillippe gave for address extraction.
So I am now trying an new add-on algorithm to extract named entities
and used them as automatic tags. The named entity extraction is not
used as substitution of PCA but a higher priority add-on. The
pseudo-code of the algorithm is as following:

   global predefined_set = {'lbl_1', 'lb_2', 'lb_3', ...}
   assign_tag(content_vector cv)
   {
        /* extract named entities from content vector */
        named_entity_set = extract_named_entities(contect_vector);
        sort named entities in named_entity_set by term frequency(TF);
        foreach named entity that has TF > threshold (tentatively 2).
            add the named entity to tag set of the item;

       /* then use PCA classifier to classify contect vector  */
       tag_classified = classify(contect_vector, predefined_set);
       add tag_classified to tag set of the item;
    }

Here, named entities are obtained through unsupervised learning while
tag_classifed comes from supervised learning. They will be completing
each other to some extent.

I have already downloaded a simple named entity recognizer MINIPAR:
http://www.cs.ualberta.ca/~lindek/minipar.htm, which could recognize
people's name, organization name and city names. Any suggestion input
is welcome.

Xun
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Reply via email to