Giulio,
indeed, I was wrong about your changes to UTF-8, I was confused.
More important, trying to sum everything up for version 2.0 by extending
your TODO:
- Move to GitHub and (maybe) gradually leave SF, if everyone agrees;
- (my suggestion) Decide for acopost to keep being a pure-C system with no
true dependencies, preferably going on with a collection of single
sources+headers using as fewer libraries as possible (acopost was born of
academic research and is useful to understand how tagging works, I'd love
to keep it that way);
- Allow multi-layer tagging
- Implement the tagging routines as a library (or as a collection of files
that can be statically linked as if it were a library), having some simple
wrapper executables (met, t3, etc.) that can be called from command line;
-- tokenizer/strtok for the wrappers (tokenizer revert the changes to the
input string when finished)
- Make the tagger generic
-- No library function should call exit (the library functions should not
deal with file reading and parsing);
-- Make main functions re-entrant and avoid global variables (extremely
important)
- make CLIs uniform
-- Update usage display, so that every command help has the same layout
-- Update UserGuide according to changes in CLIs
- Start working in a more complex voting system, written in C or in a
scripting language, intended for "actual" tagging, such as from command
line;
- Eventually replace parts of our library (hashes, memory allocation, etc.)
with some alternatives if valid;
- Add some "modules" to our library, such as a unit testing and (maybe) a
library for utf8 handling and manipulation (I might also try my hand at a
simple RNN) -- it is important that additions are in line with acopost, in
terms of simple sources, no reimplementations, etc.
-- substr
- Support UTF-8 -- Currently util.c, tbt.c, t3.c, met.c, et.c contain
hard-coded latin1 strings, that are used to deal with ä, ö, ü, Ä, Ö, and Ü
there is no support for other common latin1 symbols, nor for unicode.
Complete UTF-8 support would be a good option to improve that code.
- Discuss a possible scripting language (the more I think about it, the
more I convince myself that Python scripts would be enough, with no actual
integration to the C code);
- Provide as many language models (including for textual transformation) as
possible;
- Along with unit testing, write modules (it doesn't have to be in C) to
stress test the taggers, creating large, bogus but natural language like
corpora to train and tag (it is much more a matter of technology, with
memory allocation, unicode and the like, than tagging performance).
- Remove segmentation faults (part of the unit testing and stress testing)
A lot of work, to be sure, but it would make acopost one of the best
alternatives out there.
Best regards,
Tiago
2016-02-08 22:36 GMT-02:00 Giulio Paci <giuliop...@gmail.com>:
>
> On 09/02/2016 00:17, Tiago Tresoldi wrote:
> > > - do you think it is a good idea to move from SourceForge? I
guess the best alternative would be GitHub, keeping the homepage and this
mailing list for the time being.
> > As long as git is used, [...] I am perfectly fine with it.
> > Ok, two votes, and I share your opinion on pull-requests. There is
absolutely no point in moving from git, but I was more inclined to leaving
SF (now that they are even
> > involved in some shady business, like appending their wares to Windows
installers) that actually favoring GitHub.
> >
> > Anyway, two votes, but I would only be confortable with moving from SF
if everyone agrees. Ulrik, please feel free to say you prefer not to, if it
is the case.
>
>
>
> > > - we have already discussed the "voting system", where multiple
taggers are run, their results collected and the best one is selected
("voting" is not the best description,
> > > as there could be neural networks, hard-coded rules, etc.); do
you think it is best to code it in C or we should explore some alternative,
like Lua? While I'd like the
> > > flexibility of a scripting language (for the users, too), I am
not very favorable to the idea of embedding a full language or having such
kind of dependency. I believe that
> > > being pure-C and no-dependency system is one of the strengths of
acopost.
> >
> > I would prefer to stick to pure-C, if it is going to be a usable
tagger or a framework.
> > My main goal is to have a library that I can use in a program and
maintaining a C-library is much more convenient than maintaining anything
else (e.g., ABI compatibility is
> > easy to assess, there is no runtime that must be loaded and can
interfere with other libraries, memory usage is predictable, ...).
> >
> >
> > I understand. My idea was to leave the "core" in pure-C, with only
perfectly reasonable dependencies (standard headers, posix, etc.); however,
I still think that it might
> > be interesting to have a) some basic unit testing
>
> I agree, I also would like to see them.
>
> > and b) some default scripting, so that we don't depend on a shell
(while bash is pretty a much a default choice nowadays),
> > Python or Perl 5.
>
> I do not think I would like to see any "outside" script with an inside
scripting.
> Anyway in my mind there is a clear distinction between programs and
library. The library should stick to C and try to not harm its users as
much as possible (including a
> scripting language may easily become harmful), the programs can do
wathever they want.
>
> > One thing to note is that, thanks to Ingo's original goal and
resources, acopost has the quality of being a good C-project, in the sense
that it takes care of everything by
> > itself (not just being "self-contained") and incorporates the bare
minimum to work. If it were in GitHub, for example, I suspect people would
end up copying hash.c just to
> > have a simple, single source C hash implementation that does not
involve dependencies, libraries, etc.
>
> Probably you are right, but there are also solutions like
https://github.com/fragglet/c-algorithms that could be used.
> I would see benefits in trying to merge acopost hash.c (if better than
what is already in c-algorithms) there and use that c-algorithms in
acopost: the code will receive
> much more attention and testing. If a person uses that code and a bug is
found or an improvement is implemented, it could be updated in a central
place...
>
> > In a way, this is why I'd favor some simple internal scripting for
> > the voting system: such a system is not essential for tagging, and thus
would be a complement to acopost -- if necessary, people could rewrite it
in C and could incorporate
> > it. However, I still don't know what language/implentation I would
support: Lua is the closer to what I'd like, but it is still far too big.
At the same time, I wouldn't
> > like "yet another" poor implementation of a Lisp that servers no one
and only bloats the project.
>
> If I had to vote, I would probably pick up lua as well, as it is rather
intuitive for imperative-oriented programmers.
> But, in general, I would avoid it if it is not really needed for a strong
use case (i.e., something stronger than replacing bash, perl or python)
that is really difficult
> to implement in C.
>
> > On the other end I think that we should at least think about
introducing some dependency or carefully plan alternatives.
> > I understand that it is easier to bootstrap a system without any
dependency, but I would like to see UTF-8 support properly implemented and
it is very very difficult to
> > implement properly, while libICU (http://site.icu-project.org/) is
already there (although it is suboptimal that internally it is using
UTF-16).
> > Another option, that will probably be more appropriate for a
library, is to avoid "lowercase" transformation and implement a
normalization mechanism based on callbacks.
> > In this way the burden to deal with UTF-8 is delegated to the
library user and we do not need to add libICU dependency.
> >
> >
> > I was looking at your Unicode work earlier.
>
> The only work that I have done in acopost, up to now is just fixing the
encoding of the source files. I have not yet tried to implement any about
unicode support.
>
> > I have to confess that I am a bit torn. A proper Unicode handling is
needed, no doubt -- in fact, I had this problem what I
> > started working with acopost for my "tesi" at the Università di Pisa,
and handled the text with a poor Python script before and after tagging it.
>
> This is a suboptimal solution, dealing only with the minor problem of
utf8 that is charset encoding.
>
> > Acopost is a collection of
> > taggers, however, and I wonder if this kind of handling isn't something
that should be up to the user,
>
> If we can let users to use proper support, for whatever they prefer, I
would favor to let the user deal with it.
> However this is not something that can be done with a simple script: the
user will need to provide some specific callbacks.
>
> > with the system providing some helping tools: shouldn't we decide for
> > a default (utf-8) and give tools for textual transformations (with
defaults and examples, such as the language models)?
>
> The issue I am referring to is the fact that unicode defines several
functions to perform lower/upper/case-insensitive transformations (that are
language and context
> dependent) and that there are four equivalent rappresentations for many
strings, that should be interpreted as the same string.
> As far as I know, in acopost there is only an attempt to simulate
case-insensitive collation by using lower case in dictionaries and a few
other parts.
> If this is true, the only thing that we need to let library users having
proper unicode support is to add a "normalization" callback that is used
where currently lowercase
> is used. This will let further normalization (e.g.: removing diacritics)
>
> > If I remember correctly, Ulrik used acopost for pos-tagging of Ancient
Greek (and I still think it would be a great tool to start extending the
Project Perseus treebank),
> > he probably has something to add from direct experience.
>
> In Ancient Greek you will encounter the lower/upper/case-insensitive as
SIGMA has one uppercase form and two lower case forms.
>
> Turkish has different rules for uppercase i that is an İ instead of I,
while lower case I is ı instead of i.
>
> In German upeercase of muß is MUSS. The lowercase of MUSS can be both muß
and muss.
>
> These are just a few examples, but explain quite well why unicode
explicitly specify a case-insensitive form.
>
> Anyway, the more I think about it, the more I am convinced that we should
abstract these transformation away from the core library and clearly define
an interface to use
> whatever is needed by the user.
>
> Cheers,
> Giulio.
>
>
>
> > > 2016-02-04 23:23 GMT-02:00 Giulio Paci <giuliop...@gmail.com
<mailto:giuliop...@gmail.com> <mailto:giuliop...@gmail.com <mailto:
giuliop...@gmail.com>>>:
> > >
> > > Hi to all!
> > >
> > > On 01/02/2016 21:27, Giulio Paci wrote:
> > > > Il 01/feb/2016 21:12, "Ulrik Sandborg-Petersen" <
ulr...@scripturesys.com <mailto:ulr...@scripturesys.com> <mailto:
ulr...@scripturesys.com
> > <mailto:ulr...@scripturesys.com>> <mailto:ulr...@scripturesys.com
<mailto:ulr...@scripturesys.com> <mailto:ulr...@scripturesys.com <mailto:
ulr...@scripturesys.com>>>>
> > ha scritto:
> > > >> Secondly, I think the proposed changes to the options are
good, so from me it is a "go ahead" on the options.
> > > >
> > > > Perfect.
> > > > @Tiago: any opinion?
> > >
> > > I finally decided to push my current master branch.
> > > The work is not complete and there are a few bugs, but as I
do not know if I will be able to update the code again in the near future,
I prefer to share it and
> > tell you
> > > what should be fixed.
> > >
> > > 1) Updating the CLIs of commands, I decided that the same
option should be associated to almost the same meaning in every command. So
I had to rename a several of
> > them in
> > > several commands. However I did not update the documentation
yet.
> > >
> > > Here follows the list of CLI changes, so that it is possible
to update old command lines:
> > >
> > > acopost-et, acopost-t3, acopost-tbt, acopost-met:
> > > more strict separation between options and other
parameters
> > > collapsing multiple options into one is not supported
anymore
> > > -- can be used to explicitly terminate options
> > > acopost-et:
> > > added -h
> > > -t => -o test
> > > lexiconfile => -l lexiconfile
> > > acopost-tbt:
> > > added -h
> > > -r => -R
> > > -n => -r
> > > -o accepts tag, test and train in addtition to 0, 1
and 2
> > > acopost-t3
> > > -u => -Z
> > > lexiconfile => -l lexiconfile
> > > -q => -v 0
> > > -t => -o test
> > > -d => -o debug
> > > -m => -o
> > > -l => -L
> > > acopost-met:
> > > -c => -o
> > > -s => -C
> > > -m => -P
> > > -t => -M
> > > -p => -K
> > > acopost-lex2theta:
> > > added -h
> > > added -r <int>
> > > lexiconfile => -l lexiconfile
> > > acopost-complementary-rate:
> > > -q => -v 0
> > > acopost-evaluate:
> > > -v => -v 1 (this is the default now)
> > > -i => -C
> > > acopost-split-corpus:
> > > -v => -v 1 (this is the default now)
> > > -m => -k
> > > -p => -F
> > > acopost-cooked2fntbl:
> > > -v => -v 1 (this is the default now)
> > > acopost-interchange-matrix:
> > > -q => -v 0
> > > acopost-mean-and-sd:
> > > -s => -D
> > > acopost-cooked2wtree:
> > > -e => -X
> > > -i => -I
> > > -d => -o debug
> > > -a => -A
> > > -b => -B
> > >
> > > 2) I introduced a BUG in met, that breaks tagging
functionality (in viterbi mode). The BUG has been introduced after
ce310074f1b194f192cec0cf4822bb8ec7b87e78 (I
> > checked and
> > > it is producing much more reasonable results) and is probably
related to the lowercase function replacement;
> > >
> > > 3) Using -n option on met gives segmentation fault in my
environment. I did not yet investigate the cause.
> > >
> > > Bests,
> > > Giulio
> > >
> > >
------------------------------------------------------------------------------
> > > Site24x7 APM Insight: Get Deep Visibility into Application
Performance
> > > APM + Mobile APM + RUM: Monitor 3 App instances at just
$35/Month
> > > Monitor end-to-end web transactions and take corrective
actions now
> > > Troubleshoot faster and improve end-user experience. Signup
Now!
> > >
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > > _______________________________________________
> > > acopost-devel mailing list
> > > acopost-devel@lists.sourceforge.net <mailto:
acopost-devel@lists.sourceforge.net> <mailto:
acopost-devel@lists.sourceforge.net
> > <mailto:acopost-devel@lists.sourceforge.net>>
> > > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> > >
> > >
> > >
> > >
> > >
------------------------------------------------------------------------------
> > > Site24x7 APM Insight: Get Deep Visibility into Application
Performance
> > > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > > Monitor end-to-end web transactions and take corrective actions
now
> > > Troubleshoot faster and improve end-user experience. Signup Now!
> > > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > >
> > >
> > >
> > > _______________________________________________
> > > acopost-devel mailing list
> > > acopost-devel@lists.sourceforge.net <mailto:
acopost-devel@lists.sourceforge.net>
> > > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> > >
> >
> >
> >
------------------------------------------------------------------------------
> > Site24x7 APM Insight: Get Deep Visibility into Application
Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > _______________________________________________
> > acopost-devel mailing list
> > acopost-devel@lists.sourceforge.net <mailto:
acopost-devel@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >
> >
> >
> >
> >
------------------------------------------------------------------------------
> > Site24x7 APM Insight: Get Deep Visibility into Application Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> >
> >
> >
> > _______________________________________________
> > acopost-devel mailing list
> > acopost-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >
>
>
>
------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> acopost-devel mailing list
> acopost-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/acopost-devel
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel