Hi guys
I want to indicate which items
indicated in my previous status email are fixed now:
1) Prepare Similarity fro release from sandbox:
a) improve readme.txt, add 'The entry point to
Similarity component is
SentencePairMatchResult matchRes =
sm.assessRelevance(sentence1,sentence2);
where matchRes includes the similarity score (weighted number of common terms)
and the set of maximum
common parse trees.
>>> Done
b) improve caching. Now it is implemented via java object serialization;
make it via CSV files>>> Done c) proper location for cache files and
resources: joernkottmann: src/test/resources
>>> Done d) verify porter stemmer (remove lucene dependecies, remove
>>> porter stemmer from /similarity>>> That will be done outside of Simlarity.
>>> Right now downloadable opennlp-tools 1.5.2 do not have Porter
>>> sytemmer. so I temporarily have it within Similarity
e)re-format code, use eclipse template for re-format joernkottmann:
http://opennlp.apache.org/code-conventions.html
>>> Done f) package into separate jar/ src using Maven
2) Next major feature of Similarity: taxonomy auto learning and using taxonomy
to improve search relevance a) see how Similarity component can help with
search tasks>>> Done. . 3) More examples and docs for similarity component
a) examples for finding similar news at allvoices.com>>> Started, but not
easy to integrate into Similarity because tightly connected with the original
project
email the code which generates search query for news articles
b)email the link to the papers on joernkottmann:
https://cwiki.apache.org/OPENNLP/nlp-papers.html>>> I extended the list with
new section on the papers on similarity'
4) Other future features/improvements for Similarity<<< These are FUTURE items
a) how can we create a more accurate Parse object running chunker
separately and then applying alignment algorithm b) Coreference component
joernkottmann: TreebankNameFinder c) apply machine learning to
parse trees + coreferences. " parse forest": is it a good name?
joernkottmann: CorefSample.
RegardsBoris