-------
I am still using the sub word-set approach for article classification.
As the PCA works o.k. for predefined tags, I did not look into further
work to adapt more advanced stat algorithms, such as ICA and refined
sampling method. I also had made little progress in lucene wrapping,
except running through the examples in the "lucene in action" book.
++++
PCA and similar dimension reduction methods could keep track of basic
semantics, but obviously lack of the power of capturing high-level
semantics, such as the example Phillippe gave for address extraction.
So I am now trying an new add-on algorithm to extract named entities
and used them as automatic tags. The named entity extraction is not
used as substitution of PCA but a higher priority add-on. The
pseudo-code of the algorithm is as following:
global predefined_set = {'lbl_1', 'lb_2', 'lb_3', ...}
assign_tag(content_vector cv)
{
/* extract named entities from content vector */
named_entity_set = extract_named_entities(contect_vector);
sort named entities in named_entity_set by term frequency(TF);
foreach named entity that has TF > threshold (tentatively 2).
add the named entity to tag set of the item;
/* then use PCA classifier to classify contect vector */
tag_classified = classify(contect_vector, predefined_set);
add tag_classified to tag set of the item;
}
Here, named entities are obtained through unsupervised learning while
tag_classifed comes from supervised learning. They will be completing
each other to some extent.
I have already downloaded a simple named entity recognizer MINIPAR:
http://www.cs.ualberta.ca/~lindek/minipar.htm, which could recognize
people's name, organization name and city names. Any suggestion input
is welcome.
Xun
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev