Giulio,
indeed, I was wrong about your changes to UTF-8, I was confused.
More important, trying to sum everything up for version 2.0 by
extending your TODO:
- Move to GitHub and (maybe) gradually leave SF, if everyone agrees;
- (my suggestion) Decide for acopost to keep being a pure-C system
with no true dependencies, preferably going on with a collection of
single sources+headers using as fewer libraries as possible (acopost
was born of academic research and is useful to understand how tagging
works, I'd love to keep it that way);
- Allow multi-layer tagging
- Implement the tagging routines as a library (or as a collection of
files that can be statically linked as if it were a library), having
some simple wrapper executables (met, t3, etc.) that can be called
from command line;
-- tokenizer/strtok for the wrappers (tokenizer revert the changes to
the input string when finished)
- Make the tagger generic
-- No library function should call exit (the library functions should
not deal with file reading and parsing);
-- Make main functions re-entrant and avoid global variables
(extremely important)
- make CLIs uniform
-- Update usage display, so that every command help has the same layout
-- Update UserGuide according to changes in CLIs
- Start working in a more complex voting system, written in C or in a
scripting language, intended for "actual" tagging, such as from
command line;
- Eventually replace parts of our library (hashes, memory allocation,
etc.) with some alternatives if valid;
- Add some "modules" to our library, such as a unit testing and
(maybe) a library for utf8 handling and manipulation (I might also try
my hand at a simple RNN) -- it is important that additions are in line
with acopost, in terms of simple sources, no reimplementations, etc.
-- substr
- Support UTF-8 -- Currently util.c, tbt.c, t3.c, met.c, et.c contain
hard-coded latin1 strings, that are used to deal with ä, ö, ü, Ä, Ö,
and Ü there is no support for other common latin1 symbols, nor for
unicode. Complete UTF-8 support would be a good option to improve
that code.
- Discuss a possible scripting language (the more I think about it,
the more I convince myself that Python scripts would be enough, with
no actual integration to the C code);
- Provide as many language models (including for textual
transformation) as possible;
- Along with unit testing, write modules (it doesn't have to be in C)
to stress test the taggers, creating large, bogus but natural language
like corpora to train and tag (it is much more a matter of technology,
with memory allocation, unicode and the like, than tagging performance).
- Remove segmentation faults (part of the unit testing and stress testing)
A lot of work, to be sure, but it would make acopost one of the best
alternatives out there.
Best regards,
Tiago
2016-02-08 22:36 GMT-02:00 Giulio Paci <giuliop...@gmail.com
<mailto:giuliop...@gmail.com>>:
>
> On 09/02/2016 00:17, Tiago Tresoldi wrote:
> > > - do you think it is a good idea to move from SourceForge? I
guess the best alternative would be GitHub, keeping the homepage and
this mailing list for the time being.
> > As long as git is used, [...] I am perfectly fine with it.
> > Ok, two votes, and I share your opinion on pull-requests. There is
absolutely no point in moving from git, but I was more inclined to
leaving SF (now that they are even
> > involved in some shady business, like appending their wares to
Windows installers) that actually favoring GitHub.
> >
> > Anyway, two votes, but I would only be confortable with moving
from SF if everyone agrees. Ulrik, please feel free to say you prefer
not to, if it is the case.
>
>
>
> > > - we have already discussed the "voting system", where
multiple taggers are run, their results collected and the best one is
selected ("voting" is not the best description,
> > > as there could be neural networks, hard-coded rules, etc.);
do you think it is best to code it in C or we should explore some
alternative, like Lua? While I'd like the
> > > flexibility of a scripting language (for the users, too), I
am not very favorable to the idea of embedding a full language or
having such kind of dependency. I believe that
> > > being pure-C and no-dependency system is one of the
strengths of acopost.
> >
> > I would prefer to stick to pure-C, if it is going to be a
usable tagger or a framework.
> > My main goal is to have a library that I can use in a program
and maintaining a C-library is much more convenient than maintaining
anything else (e.g., ABI compatibility is
> > easy to assess, there is no runtime that must be loaded and
can interfere with other libraries, memory usage is predictable, ...).
> >
> >
> > I understand. My idea was to leave the "core" in pure-C, with only
perfectly reasonable dependencies (standard headers, posix, etc.);
however, I still think that it might
> > be interesting to have a) some basic unit testing
>
> I agree, I also would like to see them.
>
> > and b) some default scripting, so that we don't depend on a shell
(while bash is pretty a much a default choice nowadays),
> > Python or Perl 5.
>
> I do not think I would like to see any "outside" script with an
inside scripting.
> Anyway in my mind there is a clear distinction between programs and
library. The library should stick to C and try to not harm its users
as much as possible (including a
> scripting language may easily become harmful), the programs can do
wathever they want.
>
> > One thing to note is that, thanks to Ingo's original goal and
resources, acopost has the quality of being a good C-project, in the
sense that it takes care of everything by
> > itself (not just being "self-contained") and incorporates the bare
minimum to work. If it were in GitHub, for example, I suspect people
would end up copying hash.c just to
> > have a simple, single source C hash implementation that does not
involve dependencies, libraries, etc.
>
> Probably you are right, but there are also solutions like
https://github.com/fragglet/c-algorithms that could be used.
> I would see benefits in trying to merge acopost hash.c (if better
than what is already in c-algorithms) there and use that c-algorithms
in acopost: the code will receive
> much more attention and testing. If a person uses that code and a
bug is found or an improvement is implemented, it could be updated in
a central place...
>
> > In a way, this is why I'd favor some simple internal scripting for
> > the voting system: such a system is not essential for tagging, and
thus would be a complement to acopost -- if necessary, people could
rewrite it in C and could incorporate
> > it. However, I still don't know what language/implentation I would
support: Lua is the closer to what I'd like, but it is still far too
big. At the same time, I wouldn't
> > like "yet another" poor implementation of a Lisp that servers no
one and only bloats the project.
>
> If I had to vote, I would probably pick up lua as well, as it is
rather intuitive for imperative-oriented programmers.
> But, in general, I would avoid it if it is not really needed for a
strong use case (i.e., something stronger than replacing bash, perl or
python) that is really difficult
> to implement in C.
>
> > On the other end I think that we should at least think about
introducing some dependency or carefully plan alternatives.
> > I understand that it is easier to bootstrap a system without
any dependency, but I would like to see UTF-8 support properly
implemented and it is very very difficult to
> > implement properly, while libICU
(http://site.icu-project.org/) is already there (although it is
suboptimal that internally it is using UTF-16).
> > Another option, that will probably be more appropriate for a
library, is to avoid "lowercase" transformation and implement a
normalization mechanism based on callbacks.
> > In this way the burden to deal with UTF-8 is delegated to the
library user and we do not need to add libICU dependency.
> >
> >
> > I was looking at your Unicode work earlier.
>
> The only work that I have done in acopost, up to now is just fixing
the encoding of the source files. I have not yet tried to implement
any about unicode support.
>
> > I have to confess that I am a bit torn. A proper Unicode handling
is needed, no doubt -- in fact, I had this problem what I
> > started working with acopost for my "tesi" at the Università di
Pisa, and handled the text with a poor Python script before and after
tagging it.
>
> This is a suboptimal solution, dealing only with the minor problem
of utf8 that is charset encoding.
>
> > Acopost is a collection of
> > taggers, however, and I wonder if this kind of handling isn't
something that should be up to the user,
>
> If we can let users to use proper support, for whatever they prefer,
I would favor to let the user deal with it.
> However this is not something that can be done with a simple script:
the user will need to provide some specific callbacks.
>
> > with the system providing some helping tools: shouldn't we decide for
> > a default (utf-8) and give tools for textual transformations (with
defaults and examples, such as the language models)?
>
> The issue I am referring to is the fact that unicode defines several
functions to perform lower/upper/case-insensitive transformations
(that are language and context
> dependent) and that there are four equivalent rappresentations for
many strings, that should be interpreted as the same string.
> As far as I know, in acopost there is only an attempt to simulate
case-insensitive collation by using lower case in dictionaries and a
few other parts.
> If this is true, the only thing that we need to let library users
having proper unicode support is to add a "normalization" callback
that is used where currently lowercase
> is used. This will let further normalization (e.g.: removing diacritics)
>
> > If I remember correctly, Ulrik used acopost for pos-tagging of
Ancient Greek (and I still think it would be a great tool to start
extending the Project Perseus treebank),
> > he probably has something to add from direct experience.
>
> In Ancient Greek you will encounter the lower/upper/case-insensitive
as SIGMA has one uppercase form and two lower case forms.
>
> Turkish has different rules for uppercase i that is an İ instead of
I, while lower case I is ı instead of i.
>
> In German upeercase of muß is MUSS. The lowercase of MUSS can be
both muß and muss.
>
> These are just a few examples, but explain quite well why unicode
explicitly specify a case-insensitive form.
>
> Anyway, the more I think about it, the more I am convinced that we
should abstract these transformation away from the core library and
clearly define an interface to use
> whatever is needed by the user.
>
> Cheers,
> Giulio.
>
>
>
> > > 2016-02-04 23:23 GMT-02:00 Giulio Paci <giuliop...@gmail.com
<mailto:giuliop...@gmail.com> <mailto:giuliop...@gmail.com
<mailto:giuliop...@gmail.com>> <mailto:giuliop...@gmail.com
<mailto:giuliop...@gmail.com> <mailto:giuliop...@gmail.com
<mailto:giuliop...@gmail.com>>>>:
> > >
> > > Hi to all!
> > >
> > > On 01/02/2016 21:27, Giulio Paci wrote:
> > > > Il 01/feb/2016 21:12, "Ulrik Sandborg-Petersen"
<ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>
<mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>>
<mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>
> > <mailto:ulr...@scripturesys.com
<mailto:ulr...@scripturesys.com>>> <mailto:ulr...@scripturesys.com
<mailto:ulr...@scripturesys.com> <mailto:ulr...@scripturesys.com
<mailto:ulr...@scripturesys.com>> <mailto:ulr...@scripturesys.com
<mailto:ulr...@scripturesys.com> <mailto:ulr...@scripturesys.com
<mailto:ulr...@scripturesys.com>>>>>
> > ha scritto:
> > > >> Secondly, I think the proposed changes to the options
are good, so from me it is a "go ahead" on the options.
> > > >
> > > > Perfect.
> > > > @Tiago: any opinion?
> > >
> > > I finally decided to push my current master branch.
> > > The work is not complete and there are a few bugs, but
as I do not know if I will be able to update the code again in the
near future, I prefer to share it and
> > tell you
> > > what should be fixed.
> > >
> > > 1) Updating the CLIs of commands, I decided that the
same option should be associated to almost the same meaning in every
command. So I had to rename a several of
> > them in
> > > several commands. However I did not update the
documentation yet.
> > >
> > > Here follows the list of CLI changes, so that it is
possible to update old command lines:
> > >
> > > acopost-et, acopost-t3, acopost-tbt, acopost-met:
> > > more strict separation between options and other
parameters
> > > collapsing multiple options into one is not
supported anymore
> > > -- can be used to explicitly terminate options
> > > acopost-et:
> > > added -h
> > > -t => -o test
> > > lexiconfile => -l lexiconfile
> > > acopost-tbt:
> > > added -h
> > > -r => -R
> > > -n => -r
> > > -o accepts tag, test and train in addtition to
0, 1 and 2
> > > acopost-t3
> > > -u => -Z
> > > lexiconfile => -l lexiconfile
> > > -q => -v 0
> > > -t => -o test
> > > -d => -o debug
> > > -m => -o
> > > -l => -L
> > > acopost-met:
> > > -c => -o
> > > -s => -C
> > > -m => -P
> > > -t => -M
> > > -p => -K
> > > acopost-lex2theta:
> > > added -h
> > > added -r <int>
> > > lexiconfile => -l lexiconfile
> > > acopost-complementary-rate:
> > > -q => -v 0
> > > acopost-evaluate:
> > > -v => -v 1 (this is the default now)
> > > -i => -C
> > > acopost-split-corpus:
> > > -v => -v 1 (this is the default now)
> > > -m => -k
> > > -p => -F
> > > acopost-cooked2fntbl:
> > > -v => -v 1 (this is the default now)
> > > acopost-interchange-matrix:
> > > -q => -v 0
> > > acopost-mean-and-sd:
> > > -s => -D
> > > acopost-cooked2wtree:
> > > -e => -X
> > > -i => -I
> > > -d => -o debug
> > > -a => -A
> > > -b => -B
> > >
> > > 2) I introduced a BUG in met, that breaks tagging
functionality (in viterbi mode). The BUG has been introduced after
ce310074f1b194f192cec0cf4822bb8ec7b87e78 (I
> > checked and
> > > it is producing much more reasonable results) and is
probably related to the lowercase function replacement;
> > >
> > > 3) Using -n option on met gives segmentation fault in my
environment. I did not yet investigate the cause.
> > >
> > > Bests,
> > > Giulio
> > >
> > >
------------------------------------------------------------------------------
> > > Site24x7 APM Insight: Get Deep Visibility into
Application Performance
> > > APM + Mobile APM + RUM: Monitor 3 App instances at just
$35/Month
> > > Monitor end-to-end web transactions and take corrective
actions now
> > > Troubleshoot faster and improve end-user experience.
Signup Now!
> > > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > > _______________________________________________
> > > acopost-devel mailing list
> > > acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>
<mailto:acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>>
<mailto:acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>
> > <mailto:acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>>>
> > > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> > >
> > >
> > >
> > >
> > >
------------------------------------------------------------------------------
> > > Site24x7 APM Insight: Get Deep Visibility into Application
Performance
> > > APM + Mobile APM + RUM: Monitor 3 App instances at just
$35/Month
> > > Monitor end-to-end web transactions and take corrective
actions now
> > > Troubleshoot faster and improve end-user experience. Signup Now!
> > > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > >
> > >
> > >
> > > _______________________________________________
> > > acopost-devel mailing list
> > > acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>
<mailto:acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>>
> > > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> > >
> >
> >
> >
------------------------------------------------------------------------------
> > Site24x7 APM Insight: Get Deep Visibility into Application
Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective
actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > _______________________________________________
> > acopost-devel mailing list
> > acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>
<mailto:acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>>
> > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >
> >
> >
> >
> >
------------------------------------------------------------------------------
> > Site24x7 APM Insight: Get Deep Visibility into Application Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> >
> >
> >
> > _______________________________________________
> > acopost-devel mailing list
> > acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >
>
>
>
------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> acopost-devel mailing list
> acopost-devel@lists.sourceforge.net
<mailto:acopost-devel@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/acopost-devel
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel