Hi Xun,
Xun Luo wrote:
I am still using the sub word-set approach for article classification.
As the PCA works o.k. for predefined tags, I did not look into further
work to adapt more advanced stat algorithms, such as ICA and refined
sampling method. I also had made little progress in lucene wrapping,
except running through the examples in the "lucene in action" book.
I'd be interested to know how you made the "PCA work for predefined tags".
++++
PCA and similar dimension reduction methods could keep track of basic
semantics, but obviously lack of the power of capturing high-level
semantics, such as the example Phillippe gave for address extraction.
So I am now trying an new add-on algorithm to extract named entities
and used them as automatic tags. The named entity extraction is not
used as substitution of PCA but a higher priority add-on. The
pseudo-code of the algorithm is as following:
global predefined_set = {'lbl_1', 'lb_2', 'lb_3', ...}
assign_tag(content_vector cv)
{
/* extract named entities from content vector */
named_entity_set = extract_named_entities(contect_vector);
sort named entities in named_entity_set by term frequency(TF);
foreach named entity that has TF > threshold (tentatively 2).
add the named entity to tag set of the item;
/* then use PCA classifier to classify contect vector */
tag_classified = classify(contect_vector, predefined_set);
add tag_classified to tag set of the item;
}
Here, named entities are obtained through unsupervised learning while
tag_classifed comes from supervised learning. They will be completing
each other to some extent.
I have already downloaded a simple named entity recognizer MINIPAR:
http://www.cs.ualberta.ca/~lindek/minipar.htm, which could recognize
people's name, organization name and city names. Any suggestion input
is welcome.
I've high hopes that the Lucene synonym recognizer is going to help a
lot here to extract named entities beyond this small set. This is a
great set though to start with.
Couple of questions:
- What is MINIPAR's license?
- Does MINIPAR gives you a clue as to what the semantic of the entity
is? (i.e. if it's a people name or location name)
- What about prefixing such entities with their semantic class? (e.g.
"Smith" is coded in the content_vector as "People:Smith"). That way, if
we extract other semantic entities from Chandler's data, we could cross
ref from the text to other Chandler's fields (e.g. "[EMAIL PROTECTED]" from
the "From" field could be coded as "People:Smith" and will map the same
way as "People:Smith" extracted from the text).
Cheers,
- Philippe
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev