Re: [acopost-devel] Plans for 2.0.0 release

Giulio Paci Thu, 11 Feb 2016 13:03:40 -0800

On 10/02/2016 13:18, Tiago Tresoldi wrote:
GP> I want a shared library

> Here I am very biased, partly because I never truly worked in a shared 
> library, partly because I tend to prefer monolithic, static systems. I can, 
> however, understand why
> it could be desirable, and don't have any real objections, especially if it 
> means we can isolate the tagging itself from the textual manipulation (as 
> above).


Actually, if we will use libtool (and I suggest we will do), if you prefer 
monolithic static systems, you just have to request it during the configure 
phase, so this is not
a problem.
Planning a shared library will just lead us to a much better design and will 
force us to think in terms of reusable interfaces rather than cut-paste-modify 
code.
This will also let us implement some unit testing, rather than relying only on 
higher level tests.

>     >> - Allow multi-layer tagging
> 
>     With this line I meant the possibility to tag already annotated data, so 
> that it would be possible to annotate further information (e.g.: implement 
> shallow parsing on top
>     of POS tagging).
> 
> 
> Oh yes, that was what I had understood. I think it is an important feature, 
> considering the path NLP is on. Of course it is possible to just simulate it 
> (I remember a
> recipe for NLTK that just ran multiple times, combining the previous text and 
> tag into a new tag, and it worked pretty well), and thus it I would put it as 
> priority, but a
> proper handling of multi-layer tags would be desirable.

Perfect.

>     >> - Make the tagger generic
> 
>     Can you elaborate on this? (Maybe I lost connection with the original 
> context of this point)
> 
> 
> I don't know if Giulio meant the same I did, but I use "generic" in the sense 
> that the tagging functions don't make any assumption on the data they are 
> processing, not even
> that it is natural language, or even textual. While I don't really believe in 
> technical analysis, for example, I have seem some people "tagging" financial 
> data to find bull
> and bear movements, and have read about some genetic data processing that, in 
> a way, could be called tagging. Which goes back to my ideas on unicode 
> handling...

Ok, if those words are mine, I referred to a different concept: the fact that 
the interface should be general and not impose limits like the fact that the 
input is coming
from a file.
On the other end I completely agree with you that it would be nice to let 
people tag generic data. However this will involve two changes:
1) we need to define a format that is capable of handling binary data. If we 
want to keep compatibility with cooked format, we must think about how to 
properly escape
spaces and new lines in the first place and how to encode binary string.

>     >> - Start working in a more complex voting system, written in C or in a 
> scripting language, intended for "actual" tagging, such as from command line;
> 
>     I prefer C, so that it will be more easy to use programmatically.
>     However I am in favour to experimentations using a scripting program from 
> command line, if it simplify experimentations.
> 
> 
> At heart, I am a pythonist (and I am doing a lot of work in Python nowadays) 
> and maybe a lisper, but I too would prefer a pure C implementation. As Ulrik 
> is in favor in
> Python, too, I think we could write some helper scripts in Python, but still 
> making sure that Python is not needed for essential tagging functions.
>  



>     Another option (and more I think about it, more I am convinced about it) 
> is to identify the logical purpose of these strings and let users of the 
> taggers to specify
>     callbacks to achieve their purpose. In this way there is no need for 
> explicit support for UTF-8 and still it will be possible to create correct 
> UTF-8 aware taggers.
>     Eventually we should add libICU or UTF-8 support only for command lines 
> and even make it conditional.
> 
> 
> I guess this means that we agree, that is great. It still means we can offer 
> some defaults when libICU is not available/desirable, at least for some 
> alphabetic (or at least
> European) languages. It would be great to have someone using acopost for 
> Arabic, Chinese...

Yes we agree. :-) The only default we can reasonably try to implement without 
libICU is latin1 support. With libICU it is possible to support all the 
languages whose texts
can be written in UTF-8. I agree that we can provide implementation using 
libICU and a replacement supporting the whole range of latin1.
If possible, design the code so that this can be changed by a program, without 
touching the code.

>     > If we need a scripting language for anything, Python would be my 
> preferred scripting language, since it is the one I know best.
> 
>     Apparently we are 2 against one here. I really dislike Python, that has 
> always created troubles to me.
>     I think it is quite difficult to develop a reliable script in Python that 
> can be trusted on different environments with respect to the one of the 
> original developer.
>     Maybe I have been very unlucky in my experience with Python, but I keep 
> fighting against improper locales and charsets handling, wrong automatic 
> assumptions about files
>     (especially if redirection is involved) and difficult handling of .pyc 
> files when scripts are installed in a shared location (where multiple 
> computer with several
>     architectures exist) or when multiple python versions are available on 
> the system. I think these issues are alleviated with Python 3, but still I do 
> not feel confortable
>     trusting this language.
>     On the other hand I agree that Perl syntax is not very nice. My points in 
> keep using perl are: 1) it is already used and, unless we are replacing all 
> the tools we have in
>     acopost, adding a script in a different language will also add a new 
> dependency; 2) the behaviour is much more predictable than python usually 
> (e.g., no automatic charset
>     conversion happens when writing to stdout or reading from stdin, so that 
> it is not possible that a program will fail just because you are using a 
> different terminal
>     emulator, a different locale or are redirecting to a file); 3) perl is 
> installed by default on Debian (even when minimal installation), so it is a 
> no dependency there.
> 
> 
> For me, the problem is that I don't see any alternative. I once needed to 
> handle utf-8 in Lua, and Lua is indeed C when it comes to strings: it doesn't 
> really care and lets
> you shoot in your foot. However, there are libraries that could be used, and 
> maybe we could decide for Lua (which is easier to integrate into C and far 
> lighter than Python).
> 
> Perl is in a development limbo, and personally I have never used, I couldn't 
> contribute much. Other languages are too exotic to make sense in what we 
> decided to be a pure-C
> system.

I think we should divide into:
1) embedded scripting in one of the program we are developing, then I am in 
favour of lua, even if I do not know it too much;
2) scripts for other tools, where I agree we should avoid exotic tools. My 
personal preferences goes to a mixture of Bourne shell (avoiding Bash-only 
syntax if possible)
and Perl.

Some of the reason I do not like python is that, I always find myself 
experimenting some issue whenever Python is involved.
I think this page gives a brief overview why I do not think Python is a good 
tool for text processing when a language different that English (or better, 
different from the
language defined by the locale of the Python programmer) is involved:

https://pythonhosted.org/kitchen/unicode-frustrations.html

I have seen all of these problems in real scenarios. I even worked in a project 
where two Python scripts where experiencing different cases, so that it was not 
possible to
use both of them with the same environment.
It is possible to overcome these issues and I have done that several times in 
the past, but it is very difficult to do that properly.

>     >> - Provide as many language models (including for textual 
> transformation) as possible;
> 
>     I am generally positive about it, but I am not sure if it is best to:
>     a) store the language models in the acopost sources;
>     b) store them in a separate repository;
>     c) store them in multiple separate repositories;
>     d) store sources of the models as well or not.
> 
>     What is your opinion? I personally do not like option a), but am afraid 
> of the complexity of other options.
>     I would really like a catalogue of tagged corpora that can be used for 
> POS tagging development.
> 
> 
> I am between A and B, closer to A. But given this is not a problem yet, I'd 
> keep them in current repository, moving them to their own when (and if) 
> needed.

If you think there could ever be this need, then A is not a viable possibility. 
Once you store a file in git, it is not possible to remove it from the history 
and every
person cloning a repository is going to clone the whole history. So, decision B 
and C can be reverted, A is forever.
An hybrid approach could be to have a separate repository for language models 
and sources and have the language models imported automatically in sources 
using git submodules.
Sources will maintain only one link in the history and people retrieving 
acopost with git will get everything.

However I have to say that I still have to manage to understand how to live 
peacefully with git submodules.

>     For unit testing and other testing, I would like to have tap output 
> (https://en.wikipedia.org/wiki/Test_Anything_Protocol), that is simple to 
> produce from any language, so
>     that the test suite can easily be created in mixed languages. If you 
> agree with this idea, I can configure autoconf to support it.
> 
> 
> I don't really know TAP, but was thinking about something far simples, just a 
> sequence of asserts to test function calls, and a bogus corpora generator for 
> stressing them
> (particularly in terms of memory allocation, combined with valgrind). Let's 
> wait for Ulrik's opinion.

TAP is just a way to define the output of the tests. The main benefit is that 
it is quite general and machine readable.
It is perfect for unit testing (e.g.: to keep track of "just a sequence of 
asserts to test function calls"), integration testing and regressions testing.

Probably it is easier if I setup an example and then eventually revert if you 
do not like it.

Cheers,
        Giulio



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel

Re: [acopost-devel] Plans for 2.0.0 release

Reply via email to