Hi guys
per Aliaksandr's suggestion, below are the minutes of our conversation with
Jorn about Similarity component and other related issues
1) Prepare Similarity fro release from sandbox:
a) improve readme.txt, add 'The entry point to
Similarity component is
SentencePairMatchResult matchRes =
sm.assessRelevance(sentence1,sentence2);
where matchRes includes the similarity score (weighted number of common terms)
and the set of maximum
common parse trees.
b) improve cacheing. Now it is implemented via java object serialization;
make it via CSV files
c) proper location for cache files and resources: joernkottmann:
src/test/resources d) verify porter stemmer (remove lucene dependecies,
remove porter stemmer from /similarity e)re-format code, use eclipse
template for re-format joernkottmann:
http://opennlp.apache.org/code-conventions.html f) package into separate
jar/ src using Maven
2) Next major feature of Similarity: taxonomy auto learning and using taxonomy
to improve search relevance a) see how Similarity component can help with
search tasks' b) integration with SOLR (compare/complement
github.com/tamingtext of Grant Ingersoll with Similarity). there are some JIRA
issue opened for hooking in some of tamingtext stuff to the analyzers modules
in Solr 3) More examples and docs for similarity component a) examples
for finding similar news at allvoices.com email the code which
generates search query for news articles b)email the link to the papers on
joernkottmann: https://cwiki.apache.org/OPENNLP/nlp-papers.html
4) Other future features/improvements for Similarity a) how can we
create a more accurate Parse object running chunker separately and then
applying alignment algorithm b) Coreference component
joernkottmann: TreebankNameFinder c) apply machine learning to parse trees
+ coreferences. " parse forest": is it a good name? joernkottmann:
CorefSample.
RegardsBoris