Hello all,
I created github organization for acopost and uploaded the source
> repository there.
>
Great, thank you. We can migrate (and update!) the homepage later, and for
the time being keep the discussion here.
> > I like the idea of a pure C implementation.
>
> I fully agree. So I think the decision is pure-C implementation without
> dependencies, rediscuss when we will feel an urgent need to add a
> dependency.
> My main candidate is libICU at the moment, but I will be happy to avoid it
> as much as possible, if this will not prevent the possibility to implement
> a proper solution for
> a given task.
>
Regarding dependencies, when we agree that they make sense for
performance/usability (such as for unicode handling), I have no problems
with using them if we keep a pure-C fallback. For example, if we decide for
libICU or something else, I have no problem with the code using them as
long as the current lowering function is available and called if the
library is not installed or I don't want to use it.
Regarding libICU, it seems a perfectly reasonable library for Unicode
handling, but given that textual transformations are somewhat ad-hoc and
may involve more than work on the encoding (it can be integrated with a
stemmer, for example), I am still not entirely sure about it from an
architectural point of view. Maybe we should just expect that the tagging
functions (that is, the ones to be "isolated" in the libraries) receive
properly normalized data? An actual tagger (i.e., the executable) calling
libICU could be provided as an example, so that the tagging functions are
completely general (an, thus, could be used for tagging other than POS).
Here I would prefer a shared library. I had experience with shared and
> static libraries on Linux, Windows and Mac OSX, so I think it would not be
> to difficult to maintain a
> library, once we are ready. I also have some experience with swig, so that
> eventually the library will be usable from other languages, if needed.
>
Here I am very biased, partly because I never truly worked in a shared
library, partly because I tend to prefer monolithic, static systems. I can,
however, understand why it could be desirable, and don't have any real
objections, especially if it means we can isolate the tagging itself from
the textual manipulation (as above).
> >> - Allow multi-layer tagging
>
> With this line I meant the possibility to tag already annotated data, so
> that it would be possible to annotate further information (e.g.: implement
> shallow parsing on top
> of POS tagging).
>
Oh yes, that was what I had understood. I think it is an important feature,
considering the path NLP is on. Of course it is possible to just simulate
it (I remember a recipe for NLTK that just ran multiple times, combining
the previous text and tag into a new tag, and it worked pretty well), and
thus it I would put it as priority, but a proper handling of multi-layer
tags would be desirable.
> >> - Make the tagger generic
>
> Can you elaborate on this? (Maybe I lost connection with the original
> context of this point)
>
I don't know if Giulio meant the same I did, but I use "generic" in the
sense that the tagging functions don't make any assumption on the data they
are processing, not even that it is natural language, or even textual.
While I don't really believe in technical analysis, for example, I have
seem some people "tagging" financial data to find bull and bear movements,
and have read about some genetic data processing that, in a way, could be
called tagging. Which goes back to my ideas on unicode handling...
> >> - make CLIs uniform
>
> I consider this point almost complete, but some reviews and bug fixing is
> needed.
>
And thus we are closer to 2.0, thanks to you. :)
> >> - Start working in a more complex voting system, written in C or in a
> scripting language, intended for "actual" tagging, such as from command
> line;
>
> I prefer C, so that it will be more easy to use programmatically.
> However I am in favour to experimentations using a scripting program from
> command line, if it simplify experimentations.
>
At heart, I am a pythonist (and I am doing a lot of work in Python
nowadays) and maybe a lisper, but I too would prefer a pure C
implementation. As Ulrik is in favor in Python, too, I think we could write
some helper scripts in Python, but still making sure that Python is not
needed for essential tagging functions.
> Another option (and more I think about it, more I am convinced about it)
> is to identify the logical purpose of these strings and let users of the
> taggers to specify
> callbacks to achieve their purpose. In this way there is no need for
> explicit support for UTF-8 and still it will be possible to create correct
> UTF-8 aware taggers.
> Eventually we should add libICU or UTF-8 support only for command lines
> and even make it conditional.
>
I guess this means that we agree, that is great. It still means we can
offer some defaults when libICU is not available/desirable, at least for
some alphabetic (or at least European) languages. It would be great to have
someone using acopost for Arabic, Chinese...
> > If we need a scripting language for anything, Python would be my
> preferred scripting language, since it is the one I know best.
>
> Apparently we are 2 against one here. I really dislike Python, that has
> always created troubles to me.
> I think it is quite difficult to develop a reliable script in Python that
> can be trusted on different environments with respect to the one of the
> original developer.
> Maybe I have been very unlucky in my experience with Python, but I keep
> fighting against improper locales and charsets handling, wrong automatic
> assumptions about files
> (especially if redirection is involved) and difficult handling of .pyc
> files when scripts are installed in a shared location (where multiple
> computer with several
> architectures exist) or when multiple python versions are available on the
> system. I think these issues are alleviated with Python 3, but still I do
> not feel confortable
> trusting this language.
> On the other hand I agree that Perl syntax is not very nice. My points in
> keep using perl are: 1) it is already used and, unless we are replacing all
> the tools we have in
> acopost, adding a script in a different language will also add a new
> dependency; 2) the behaviour is much more predictable than python usually
> (e.g., no automatic charset
> conversion happens when writing to stdout or reading from stdin, so that
> it is not possible that a program will fail just because you are using a
> different terminal
> emulator, a different locale or are redirecting to a file); 3) perl is
> installed by default on Debian (even when minimal installation), so it is a
> no dependency there.
>
For me, the problem is that I don't see any alternative. I once needed to
handle utf-8 in Lua, and Lua is indeed C when it comes to strings: it
doesn't really care and lets you shoot in your foot. However, there are
libraries that could be used, and maybe we could decide for Lua (which is
easier to integrate into C and far lighter than Python).
Perl is in a development limbo, and personally I have never used, I
couldn't contribute much. Other languages are too exotic to make sense in
what we decided to be a pure-C system.
> >> - Provide as many language models (including for textual
> transformation) as possible;
>
> I am generally positive about it, but I am not sure if it is best to:
> a) store the language models in the acopost sources;
> b) store them in a separate repository;
> c) store them in multiple separate repositories;
> d) store sources of the models as well or not.
>
> What is your opinion? I personally do not like option a), but am afraid of
> the complexity of other options.
> I would really like a catalogue of tagged corpora that can be used for POS
> tagging development.
>
I am between A and B, closer to A. But given this is not a problem yet, I'd
keep them in current repository, moving them to their own when (and if)
needed.
> For unit testing and other testing, I would like to have tap output (
> https://en.wikipedia.org/wiki/Test_Anything_Protocol), that is simple to
> produce from any language, so
> that the test suite can easily be created in mixed languages. If you agree
> with this idea, I can configure autoconf to support it.
>
I don't really know TAP, but was thinking about something far simples, just
a sequence of asserts to test function calls, and a bogus corpora generator
for stressing them (particularly in terms of memory allocation, combined
with valgrind). Let's wait for Ulrik's opinion.
Best,
Tiago
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel