[Ankur-core] Use of TreeTagger for Anubadok and its licence

Golam Mortuza Hossain Wed, 15 Jun 2005 11:33:47 -0700

Hi,

        This post is regarding some license issues for the use
of "TreeTagger" with the translator "Anubadok". Prof. Helmut
Schmid (creator of TreeTagger) has granted a license allowing
me to integrate and distribute the TreeTagger as a component of
Anubadok. The use of it, is free for free translations of
free stuffs :-). So I have made a tar-ball of necessary
files (6.9M) with a simpler (relatively) installer script and is
available for download. Please note that it does not include
Anubadok itself rather it's just for "tree-tagger-english".



    Frankly, he has given me more rights than I had
asked for. However, its a non-GPL license. The parts-of-speech
tagger is a core requirement for any natural language
processing (NLP) software. Unfortunately, there aren't many
such tools available under GPL (even for English!!) which can
be readily used by other programs.

        For last two weeks, I was mainly hacking through
"LPost" which is based on Brill's rule-based-tagger, written in C
and available under GPL. However, it didn't have any in-built
tokenizer (something that understand that in the sentence
"Mr. Sayamindu is writing his examination.", the dot at the end
of "Mr." is not really the end of sentence but the dot at the end
of "examination." really is :-)) Also it didn't have the
lemmatizer (something that tells that the base form of "ate" is
"eat"). In fact this was the crucial drawback (specially for a
translator of Sanskrit based language where it is essential to
know the root verb). Nevertheless, its measured tagging accuracy
(for Wall Street Journal corpus) is reasonably high. It is around
91% whereas TreeTagger (uses probabilistic algorithm) accuracy is
around 93%.

  So after two weeks of hacking (sometime I felt that
it may be easier to write a new one than to understand someone
else C program :-)), its new avatar is now fitted with tokenizer
and lemmatizer and can be used as "drop-in" substitute for
TreeTagger. Although, it has slightly less accuracy but it has
great advantages. Like, you can easily train it for newer verbs like
"download" whereas you can't do it yourself for treetagger as
its database comes in binary. Also, its around 0.67M compared
to 6.9M for treetagger. I will release it both as C and Perl
API and most likely it will be integrated with Anubadok
to make it stand-alone GPL software.

 Unfortunately, I really couldn't give any time for
Anubadok itself, apart from some bug-fixing. The tagger itself
took almost all of my free time for last two weeks. I hope,
I can now switch the mode. Now I will do bug-fixing for tagger and
coding for Anubadok :-) BTW, if any of you have used Anubadok and
have some suggestions for improvements, please let me know.


Cheers,
golam


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core

[Ankur-core] Use of TreeTagger for Anubadok and its licence

Reply via email to