Hi, This post is regarding some license issues for the use of "TreeTagger" with the translator "Anubadok". Prof. Helmut Schmid (creator of TreeTagger) has granted a license allowing me to integrate and distribute the TreeTagger as a component of Anubadok. The use of it, is free for free translations of free stuffs :-). So I have made a tar-ball of necessary files (6.9M) with a simpler (relatively) installer script and is available for download. Please note that it does not include Anubadok itself rather it's just for "tree-tagger-english".
Frankly, he has given me more rights than I had asked for. However, its a non-GPL license. The parts-of-speech tagger is a core requirement for any natural language processing (NLP) software. Unfortunately, there aren't many such tools available under GPL (even for English!!) which can be readily used by other programs. For last two weeks, I was mainly hacking through "LPost" which is based on Brill's rule-based-tagger, written in C and available under GPL. However, it didn't have any in-built tokenizer (something that understand that in the sentence "Mr. Sayamindu is writing his examination.", the dot at the end of "Mr." is not really the end of sentence but the dot at the end of "examination." really is :-)) Also it didn't have the lemmatizer (something that tells that the base form of "ate" is "eat"). In fact this was the crucial drawback (specially for a translator of Sanskrit based language where it is essential to know the root verb). Nevertheless, its measured tagging accuracy (for Wall Street Journal corpus) is reasonably high. It is around 91% whereas TreeTagger (uses probabilistic algorithm) accuracy is around 93%. So after two weeks of hacking (sometime I felt that it may be easier to write a new one than to understand someone else C program :-)), its new avatar is now fitted with tokenizer and lemmatizer and can be used as "drop-in" substitute for TreeTagger. Although, it has slightly less accuracy but it has great advantages. Like, you can easily train it for newer verbs like "download" whereas you can't do it yourself for treetagger as its database comes in binary. Also, its around 0.67M compared to 6.9M for treetagger. I will release it both as C and Perl API and most likely it will be integrated with Anubadok to make it stand-alone GPL software. Unfortunately, I really couldn't give any time for Anubadok itself, apart from some bug-fixing. The tagger itself took almost all of my free time for last two weeks. I hope, I can now switch the mode. Now I will do bug-fixing for tagger and coding for Anubadok :-) BTW, if any of you have used Anubadok and have some suggestions for improvements, please let me know. Cheers, golam ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Bengalinux-core mailing list Bengalinux-core@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bengalinux-core