Hi to all, I created github organization for acopost and uploaded the source repository there.
This command line will be useful if you want to switch, without cloning from scratch, assuming that origin is your current remote for sourceforge repository: git remote set-url origin g...@github.com:acopost/acopost.git On 09/02/2016 18:57, Ulrik Sandborg-Petersen wrote: > I agree with all of the points on the TODO list. I agree with most of them. I moved to github issues the element that we already agreed. I left other points here, for reference and/or further discussion. > On 2016-02-09 16:08, Tiago Tresoldi wrote: >> More important, trying to sum everything up for version 2.0 by extending >> your TODO: >> - (my suggestion) Decide for acopost to keep being a pure-C system with no >> true dependencies, preferably going on with a collection of single >> sources+headers using as >> fewer libraries as possible (acopost was born of academic research and is >> useful to understand how tagging works, I'd love to keep it that way); >> - Eventually replace parts of our library (hashes, memory allocation, etc.) >> with some alternatives if valid; > I like the idea of a pure C implementation. I fully agree. So I think the decision is pure-C implementation without dependencies, rediscuss when we will feel an urgent need to add a dependency. My main candidate is libICU at the moment, but I will be happy to avoid it as much as possible, if this will not prevent the possibility to implement a proper solution for a given task. >> - Implement the tagging routines as a library (or as a collection of files >> that can be statically linked as if it were a library), having some simple >> wrapper executables >> (met, t3, etc.) that can be called from command line; Here I would prefer a shared library. I had experience with shared and static libraries on Linux, Windows and Mac OSX, so I think it would not be to difficult to maintain a library, once we are ready. I also have some experience with swig, so that eventually the library will be usable from other languages, if needed. >> - Allow multi-layer tagging With this line I meant the possibility to tag already annotated data, so that it would be possible to annotate further information (e.g.: implement shallow parsing on top of POS tagging). >> - Make the tagger generic Can you elaborate on this? (Maybe I lost connection with the original context of this point) >> - make CLIs uniform I consider this point almost complete, but some reviews and bug fixing is needed. >> - Start working in a more complex voting system, written in C or in a >> scripting language, intended for "actual" tagging, such as from command line; I prefer C, so that it will be more easy to use programmatically. However I am in favour to experimentations using a scripting program from command line, if it simplify experimentations. >> - Add some "modules" to our library, such as a unit testing and (maybe) a >> library for utf8 handling and manipulation (I might also try my hand at a >> simple RNN) -- it is >> important that additions are in line with acopost, in terms of simple >> sources, no reimplementations, etc. >> - Support UTF-8 -- Currently util.c, tbt.c, t3.c, met.c, et.c contain >> hard-coded latin1 strings, that are used to deal with ä, ö, ü, Ä, Ö, and Ü >> there is no support for >> other common latin1 symbols, nor for unicode. Complete UTF-8 support would >> be a good option to improve that code. Another option (and more I think about it, more I am convinced about it) is to identify the logical purpose of these strings and let users of the taggers to specify callbacks to achieve their purpose. In this way there is no need for explicit support for UTF-8 and still it will be possible to create correct UTF-8 aware taggers. Eventually we should add libICU or UTF-8 support only for command lines and even make it conditional. >> - Discuss a possible scripting language (the more I think about it, the more >> I convince myself that Python scripts would be enough, with no actual >> integration to the C code); > If we need a scripting language for anything, Python would be my preferred > scripting language, since it is the one I know best. Apparently we are 2 against one here. I really dislike Python, that has always created troubles to me. I think it is quite difficult to develop a reliable script in Python that can be trusted on different environments with respect to the one of the original developer. Maybe I have been very unlucky in my experience with Python, but I keep fighting against improper locales and charsets handling, wrong automatic assumptions about files (especially if redirection is involved) and difficult handling of .pyc files when scripts are installed in a shared location (where multiple computer with several architectures exist) or when multiple python versions are available on the system. I think these issues are alleviated with Python 3, but still I do not feel confortable trusting this language. On the other hand I agree that Perl syntax is not very nice. My points in keep using perl are: 1) it is already used and, unless we are replacing all the tools we have in acopost, adding a script in a different language will also add a new dependency; 2) the behaviour is much more predictable than python usually (e.g., no automatic charset conversion happens when writing to stdout or reading from stdin, so that it is not possible that a program will fail just because you are using a different terminal emulator, a different locale or are redirecting to a file); 3) perl is installed by default on Debian (even when minimal installation), so it is a no dependency there. >> - Provide as many language models (including for textual transformation) as >> possible; I am generally positive about it, but I am not sure if it is best to: a) store the language models in the acopost sources; b) store them in a separate repository; c) store them in multiple separate repositories; d) store sources of the models as well or not. What is your opinion? I personally do not like option a), but am afraid of the complexity of other options. I would really like a catalogue of tagged corpora that can be used for POS tagging development. >> - Along with unit testing, write modules (it doesn't have to be in C) to >> stress test the taggers, creating large, bogus but natural language like >> corpora to train and tag >> (it is much more a matter of technology, with memory allocation, unicode and >> the like, than tagging performance). >> - Remove segmentation faults (part of the unit testing and stress testing) For unit testing and other testing, I would like to have tap output (https://en.wikipedia.org/wiki/Test_Anything_Protocol), that is simple to produce from any language, so that the test suite can easily be created in mixed languages. If you agree with this idea, I can configure autoconf to support it. >> A lot of work, to be sure, but it would make acopost one of the best >> alternatives out there. I agree. ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ acopost-devel mailing list acopost-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/acopost-devel