Re: [acopost-devel] Plans for 2.0.0 release

Giulio Paci Tue, 09 Feb 2016 17:37:17 -0800

Hi to all,
        I created github organization for acopost and uploaded the source 
repository there.

This command line will be useful if you want to switch, without cloning from 
scratch, assuming that origin is your current remote for sourceforge repository:

git remote set-url origin g...@github.com:acopost/acopost.git

On 09/02/2016 18:57, Ulrik Sandborg-Petersen wrote:
> I agree with all of the points on the TODO list.

I agree with most of them. I moved to github issues the element that we already 
agreed. I left other points here, for reference and/or further discussion.

> On 2016-02-09 16:08, Tiago Tresoldi wrote:
>> More important, trying to sum everything up for version 2.0 by extending 
>> your TODO:

>> - (my suggestion) Decide for acopost to keep being a pure-C system with no 
>> true dependencies, preferably going on with a collection of single 
>> sources+headers using as
>> fewer libraries as possible (acopost was born of academic research and is 
>> useful to understand how tagging works, I'd love to keep it that way);
>> - Eventually replace parts of our library (hashes, memory allocation, etc.) 
>> with some alternatives if valid;

> I like the idea of a pure C implementation.

I fully agree. So I think the decision is pure-C implementation without 
dependencies, rediscuss when we will feel an urgent need to add a dependency.
My main candidate is libICU at the moment, but I will be happy to avoid it as 
much as possible, if this will not prevent the possibility to implement a 
proper solution for
a given task.

>> - Implement the tagging routines as a library (or as a collection of files 
>> that can be statically linked as if it were a library), having some simple 
>> wrapper executables
>> (met, t3, etc.) that can be called from command line;

Here I would prefer a shared library. I had experience with shared and static 
libraries on Linux, Windows and Mac OSX, so I think it would not be to 
difficult to maintain a
library, once we are ready. I also have some experience with swig, so that 
eventually the library will be usable from other languages, if needed.

>> - Allow multi-layer tagging

With this line I meant the possibility to tag already annotated data, so that 
it would be possible to annotate further information (e.g.: implement shallow 
parsing on top
of POS tagging).

>> - Make the tagger generic

Can you elaborate on this? (Maybe I lost connection with the original context 
of this point)

>> - make CLIs uniform

I consider this point almost complete, but some reviews and bug fixing is 
needed.

>> - Start working in a more complex voting system, written in C or in a 
>> scripting language, intended for "actual" tagging, such as from command line;

I prefer C, so that it will be more easy to use programmatically.
However I am in favour to experimentations using a scripting program from 
command line, if it simplify experimentations.

>> - Add some "modules" to our library, such as a unit testing and (maybe) a 
>> library for utf8 handling and manipulation (I might also try my hand at a 
>> simple RNN) -- it is
>> important that additions are in line with acopost, in terms of simple 
>> sources, no reimplementations, etc.

>> - Support UTF-8 --  Currently util.c, tbt.c, t3.c, met.c, et.c contain 
>> hard-coded latin1 strings, that are used to deal with ä, ö, ü, Ä, Ö, and Ü  
>> there is no support for
>> other common latin1 symbols, nor for  unicode.  Complete UTF-8 support would 
>> be a good option to improve that  code.

Another option (and more I think about it, more I am convinced about it) is to 
identify the logical purpose of these strings and let users of the taggers to 
specify
callbacks to achieve their purpose. In this way there is no need for explicit 
support for UTF-8 and still it will be possible to create correct UTF-8 aware 
taggers.
Eventually we should add libICU or UTF-8 support only for command lines and 
even make it conditional.

>> - Discuss a possible scripting language (the more I think about it, the more 
>> I convince myself that Python scripts would be enough, with no actual 
>> integration to the C code);

> If we need a scripting language for anything, Python would be my preferred 
> scripting language, since it is the one I know best.

Apparently we are 2 against one here. I really dislike Python, that has always 
created troubles to me.
I think it is quite difficult to develop a reliable script in Python that can 
be trusted on different environments with respect to the one of the original 
developer.
Maybe I have been very unlucky in my experience with Python, but I keep 
fighting against improper locales and charsets handling, wrong automatic 
assumptions about files
(especially if redirection is involved) and difficult handling of .pyc files 
when scripts are installed in a shared location (where multiple computer with 
several
architectures exist) or when multiple python versions are available on the 
system. I think these issues are alleviated with Python 3, but still I do not 
feel confortable
trusting this language.
On the other hand I agree that Perl syntax is not very nice. My points in keep 
using perl are: 1) it is already used and, unless we are replacing all the 
tools we have in
acopost, adding a script in a different language will also add a new 
dependency; 2) the behaviour is much more predictable than python usually 
(e.g., no automatic charset
conversion happens when writing to stdout or reading from stdin, so that it is 
not possible that a program will fail just because you are using a different 
terminal
emulator, a different locale or are redirecting to a file); 3) perl is 
installed by default on Debian (even when minimal installation), so it is a no 
dependency there.

>> - Provide as many language models (including for textual transformation) as 
>> possible;

I am generally positive about it, but I am not sure if it is best to:
a) store the language models in the acopost sources;
b) store them in a separate repository;
c) store them in multiple separate repositories;
d) store sources of the models as well or not.

What is your opinion? I personally do not like option a), but am afraid of the 
complexity of other options.
I would really like a catalogue of tagged corpora that can be used for POS 
tagging development.

>> - Along with unit testing, write modules (it doesn't have to be in C) to 
>> stress test the taggers, creating large, bogus but natural language like 
>> corpora to train and tag
>> (it is much more a matter of technology, with memory allocation, unicode and 
>> the like, than tagging performance).
>> - Remove segmentation faults (part of the unit testing and stress testing)

For unit testing and other testing, I would like to have tap output 
(https://en.wikipedia.org/wiki/Test_Anything_Protocol), that is simple to 
produce from any language, so
that the test suite can easily be created in mixed languages. If you agree with 
this idea, I can configure autoconf to support it.

>> A lot of work, to be sure, but it would make acopost one of the best 
>> alternatives out there.

I agree.

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel

Re: [acopost-devel] Plans for 2.0.0 release

Reply via email to