Tiago and Giulio

Thanks for your recent work, both of you.

I have no objection to moving to GitHub.  My GitHub Username is 'emg'.

I like the idea of a pure C implementation. If we need a scripting language for anything, Python would be my preferred scripting language, since it is the one I know best.

I agree with all of the points on the TODO list.

Best wishes,


Ulrik



On 2016-02-09 16:08, Tiago Tresoldi wrote:
Giulio,

indeed, I was wrong about your changes to UTF-8, I was confused.

More important, trying to sum everything up for version 2.0 by extending your TODO:

- Move to GitHub and (maybe) gradually leave SF, if everyone agrees;
- (my suggestion) Decide for acopost to keep being a pure-C system with no true dependencies, preferably going on with a collection of single sources+headers using as fewer libraries as possible (acopost was born of academic research and is useful to understand how tagging works, I'd love to keep it that way);
- Allow multi-layer tagging
- Implement the tagging routines as a library (or as a collection of files that can be statically linked as if it were a library), having some simple wrapper executables (met, t3, etc.) that can be called from command line; -- tokenizer/strtok for the wrappers (tokenizer revert the changes to the input string when finished)
- Make the tagger generic
-- No library function should call exit (the library functions should not deal with file reading and parsing); -- Make main functions re-entrant and avoid global variables (extremely important)
- make CLIs uniform
-- Update usage display, so that every command help has the same layout
-- Update UserGuide according to changes in CLIs
- Start working in a more complex voting system, written in C or in a scripting language, intended for "actual" tagging, such as from command line; - Eventually replace parts of our library (hashes, memory allocation, etc.) with some alternatives if valid; - Add some "modules" to our library, such as a unit testing and (maybe) a library for utf8 handling and manipulation (I might also try my hand at a simple RNN) -- it is important that additions are in line with acopost, in terms of simple sources, no reimplementations, etc.
-- substr
- Support UTF-8 -- Currently util.c, tbt.c, t3.c, met.c, et.c contain hard-coded latin1 strings, that are used to deal with ä, ö, ü, Ä, Ö, and Ü there is no support for other common latin1 symbols, nor for unicode. Complete UTF-8 support would be a good option to improve that code. - Discuss a possible scripting language (the more I think about it, the more I convince myself that Python scripts would be enough, with no actual integration to the C code); - Provide as many language models (including for textual transformation) as possible; - Along with unit testing, write modules (it doesn't have to be in C) to stress test the taggers, creating large, bogus but natural language like corpora to train and tag (it is much more a matter of technology, with memory allocation, unicode and the like, than tagging performance).
- Remove segmentation faults (part of the unit testing and stress testing)

A lot of work, to be sure, but it would make acopost one of the best alternatives out there.

Best regards,

Tiago



2016-02-08 22:36 GMT-02:00 Giulio Paci <giuliop...@gmail.com <mailto:giuliop...@gmail.com>>:
>
> On 09/02/2016 00:17, Tiago Tresoldi wrote:
> > > - do you think it is a good idea to move from SourceForge? I guess the best alternative would be GitHub, keeping the homepage and this mailing list for the time being.
> >     As long as git is used, [...] I am perfectly fine with it.
> > Ok, two votes, and I share your opinion on pull-requests. There is absolutely no point in moving from git, but I was more inclined to leaving SF (now that they are even > > involved in some shady business, like appending their wares to Windows installers) that actually favoring GitHub.
> >
> > Anyway, two votes, but I would only be confortable with moving from SF if everyone agrees. Ulrik, please feel free to say you prefer not to, if it is the case.
>
>
>
> > > - we have already discussed the "voting system", where multiple taggers are run, their results collected and the best one is selected ("voting" is not the best description, > > > as there could be neural networks, hard-coded rules, etc.); do you think it is best to code it in C or we should explore some alternative, like Lua? While I'd like the > > > flexibility of a scripting language (for the users, too), I am not very favorable to the idea of embedding a full language or having such kind of dependency. I believe that > > > being pure-C and no-dependency system is one of the strengths of acopost.
> >
> > I would prefer to stick to pure-C, if it is going to be a usable tagger or a framework. > > My main goal is to have a library that I can use in a program and maintaining a C-library is much more convenient than maintaining anything else (e.g., ABI compatibility is > > easy to assess, there is no runtime that must be loaded and can interfere with other libraries, memory usage is predictable, ...).
> >
> >
> > I understand. My idea was to leave the "core" in pure-C, with only perfectly reasonable dependencies (standard headers, posix, etc.); however, I still think that it might
> > be interesting to have a) some basic unit testing
>
> I agree, I also would like to see them.
>
> > and b) some default scripting, so that we don't depend on a shell (while bash is pretty a much a default choice nowadays),
> > Python or Perl 5.
>
> I do not think I would like to see any "outside" script with an inside scripting. > Anyway in my mind there is a clear distinction between programs and library. The library should stick to C and try to not harm its users as much as possible (including a > scripting language may easily become harmful), the programs can do wathever they want.
>
> > One thing to note is that, thanks to Ingo's original goal and resources, acopost has the quality of being a good C-project, in the sense that it takes care of everything by > > itself (not just being "self-contained") and incorporates the bare minimum to work. If it were in GitHub, for example, I suspect people would end up copying hash.c just to > > have a simple, single source C hash implementation that does not involve dependencies, libraries, etc.
>
> Probably you are right, but there are also solutions like https://github.com/fragglet/c-algorithms that could be used. > I would see benefits in trying to merge acopost hash.c (if better than what is already in c-algorithms) there and use that c-algorithms in acopost: the code will receive > much more attention and testing. If a person uses that code and a bug is found or an improvement is implemented, it could be updated in a central place...
>
> > In a way, this is why I'd favor some simple internal scripting for
> > the voting system: such a system is not essential for tagging, and thus would be a complement to acopost -- if necessary, people could rewrite it in C and could incorporate > > it. However, I still don't know what language/implentation I would support: Lua is the closer to what I'd like, but it is still far too big. At the same time, I wouldn't > > like "yet another" poor implementation of a Lisp that servers no one and only bloats the project.
>
> If I had to vote, I would probably pick up lua as well, as it is rather intuitive for imperative-oriented programmers. > But, in general, I would avoid it if it is not really needed for a strong use case (i.e., something stronger than replacing bash, perl or python) that is really difficult
> to implement in C.
>
> > On the other end I think that we should at least think about introducing some dependency or carefully plan alternatives. > > I understand that it is easier to bootstrap a system without any dependency, but I would like to see UTF-8 support properly implemented and it is very very difficult to > > implement properly, while libICU (http://site.icu-project.org/) is already there (although it is suboptimal that internally it is using UTF-16). > > Another option, that will probably be more appropriate for a library, is to avoid "lowercase" transformation and implement a normalization mechanism based on callbacks. > > In this way the burden to deal with UTF-8 is delegated to the library user and we do not need to add libICU dependency.
> >
> >
> > I was looking at your Unicode work earlier.
>
> The only work that I have done in acopost, up to now is just fixing the encoding of the source files. I have not yet tried to implement any about unicode support.
>
> > I have to confess that I am a bit torn. A proper Unicode handling is needed, no doubt -- in fact, I had this problem what I > > started working with acopost for my "tesi" at the Università di Pisa, and handled the text with a poor Python script before and after tagging it.
>
> This is a suboptimal solution, dealing only with the minor problem of utf8 that is charset encoding.
>
> > Acopost is a collection of
> > taggers, however, and I wonder if this kind of handling isn't something that should be up to the user,
>
> If we can let users to use proper support, for whatever they prefer, I would favor to let the user deal with it. > However this is not something that can be done with a simple script: the user will need to provide some specific callbacks.
>
> > with the system providing some helping tools: shouldn't we decide for
> > a default (utf-8) and give tools for textual transformations (with defaults and examples, such as the language models)?
>
> The issue I am referring to is the fact that unicode defines several functions to perform lower/upper/case-insensitive transformations (that are language and context > dependent) and that there are four equivalent rappresentations for many strings, that should be interpreted as the same string. > As far as I know, in acopost there is only an attempt to simulate case-insensitive collation by using lower case in dictionaries and a few other parts. > If this is true, the only thing that we need to let library users having proper unicode support is to add a "normalization" callback that is used where currently lowercase
> is used. This will let further normalization (e.g.: removing diacritics)
>
> > If I remember correctly, Ulrik used acopost for pos-tagging of Ancient Greek (and I still think it would be a great tool to start extending the Project Perseus treebank),
> > he probably has something to add from direct experience.
>
> In Ancient Greek you will encounter the lower/upper/case-insensitive as SIGMA has one uppercase form and two lower case forms.
>
> Turkish has different rules for uppercase i that is an İ instead of I, while lower case I is ı instead of i.
>
> In German upeercase of muß is MUSS. The lowercase of MUSS can be both muß and muss.
>
> These are just a few examples, but explain quite well why unicode explicitly specify a case-insensitive form.
>
> Anyway, the more I think about it, the more I am convinced that we should abstract these transformation away from the core library and clearly define an interface to use
> whatever is needed by the user.
>
> Cheers,
>         Giulio.
>
>
>
> > > 2016-02-04 23:23 GMT-02:00 Giulio Paci <giuliop...@gmail.com <mailto:giuliop...@gmail.com> <mailto:giuliop...@gmail.com <mailto:giuliop...@gmail.com>> <mailto:giuliop...@gmail.com <mailto:giuliop...@gmail.com> <mailto:giuliop...@gmail.com <mailto:giuliop...@gmail.com>>>>:
> >     >
> >     >     Hi to all!
> >     >
> >     >     On 01/02/2016 21:27, Giulio Paci wrote:
> > > > Il 01/feb/2016 21:12, "Ulrik Sandborg-Petersen" <ulr...@scripturesys.com <mailto:ulr...@scripturesys.com> <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>> <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com> > > <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>>> <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com> <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>> <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com> <mailto:ulr...@scripturesys.com <mailto:ulr...@scripturesys.com>>>>>
> >     ha scritto:
> > > >> Secondly, I think the proposed changes to the options are good, so from me it is a "go ahead" on the options.
> >     >     >
> >     >     > Perfect.
> >     >     > @Tiago: any opinion?
> >     >
> >     >     I finally decided to push my current master branch.
> > > The work is not complete and there are a few bugs, but as I do not know if I will be able to update the code again in the near future, I prefer to share it and
> >     tell you
> >     >     what should be fixed.
> >     >
> > > 1) Updating the CLIs of commands, I decided that the same option should be associated to almost the same meaning in every command. So I had to rename a several of
> >     them in
> > > several commands. However I did not update the documentation yet.
> >     >
> > > Here follows the list of CLI changes, so that it is possible to update old command lines:
> >     >
> >     >     acopost-et, acopost-t3, acopost-tbt, acopost-met:
> > > more strict separation between options and other parameters > > > collapsing multiple options into one is not supported anymore
> >     >             -- can be used to explicitly terminate options
> >     >     acopost-et:
> >     >             added -h
> >     >             -t => -o test
> >     >             lexiconfile => -l lexiconfile
> >     >     acopost-tbt:
> >     >             added -h
> >     >             -r => -R
> >     >             -n => -r
> > > -o accepts tag, test and train in addtition to 0, 1 and 2
> >     >     acopost-t3
> >     >             -u => -Z
> >     >             lexiconfile => -l lexiconfile
> >     >             -q => -v 0
> >     >             -t => -o test
> >     >             -d => -o debug
> >     >             -m => -o
> >     >             -l => -L
> >     >     acopost-met:
> >     >             -c => -o
> >     >             -s => -C
> >     >             -m => -P
> >     >             -t => -M
> >     >             -p => -K
> >     >     acopost-lex2theta:
> >     >             added -h
> >     >             added -r <int>
> >     >             lexiconfile => -l lexiconfile
> >     >     acopost-complementary-rate:
> >     >             -q => -v 0
> >     >     acopost-evaluate:
> >     >             -v => -v 1 (this is the default now)
> >     >             -i => -C
> >     >     acopost-split-corpus:
> >     >             -v => -v 1 (this is the default now)
> >     >             -m => -k
> >     >             -p => -F
> >     >     acopost-cooked2fntbl:
> >     >             -v => -v 1 (this is the default now)
> >     >     acopost-interchange-matrix:
> >     >             -q => -v 0
> >     >     acopost-mean-and-sd:
> >     >             -s => -D
> >     >     acopost-cooked2wtree:
> >     >             -e => -X
> >     >             -i => -I
> >     >             -d => -o debug
> >     >             -a => -A
> >     >             -b => -B
> >     >
> > > 2) I introduced a BUG in met, that breaks tagging functionality (in viterbi mode). The BUG has been introduced after ce310074f1b194f192cec0cf4822bb8ec7b87e78 (I
> >     checked and
> > > it is producing much more reasonable results) and is probably related to the lowercase function replacement;
> >     >
> > > 3) Using -n option on met gives segmentation fault in my environment. I did not yet investigate the cause.
> >     >
> >     >     Bests,
> >     >             Giulio
> >     >
> > > ------------------------------------------------------------------------------ > > > Site24x7 APM Insight: Get Deep Visibility into Application Performance > > > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > > > Monitor end-to-end web transactions and take corrective actions now > > > Troubleshoot faster and improve end-user experience. Signup Now!
> >     > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> >     > _______________________________________________
> >     >     acopost-devel mailing list
> > > acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net> <mailto:acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net>> <mailto:acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net> > > <mailto:acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net>>>
> >     > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >     >
> >     >
> >     >
> >     >
> > > ------------------------------------------------------------------------------ > > > Site24x7 APM Insight: Get Deep Visibility into Application Performance > > > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > > > Monitor end-to-end web transactions and take corrective actions now
> >     > Troubleshoot faster and improve end-user experience. Signup Now!
> >     > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > acopost-devel mailing list
> > > acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net> <mailto:acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net>>
> >     > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >     >
> >
> >
> > ------------------------------------------------------------------------------ > > Site24x7 APM Insight: Get Deep Visibility into Application Performance
> >     APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> >     Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> > _______________________________________________
> >     acopost-devel mailing list
> > acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net> <mailto:acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net>>
> > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >
> >
> >
> >
> > ------------------------------------------------------------------------------
> > Site24x7 APM Insight: Get Deep Visibility into Application Performance
> > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> > Monitor end-to-end web transactions and take corrective actions now
> > Troubleshoot faster and improve end-user experience. Signup Now!
> > http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> >
> >
> >
> > _______________________________________________
> > acopost-devel mailing list
> > acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/acopost-devel
> >
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> acopost-devel mailing list
> acopost-devel@lists.sourceforge.net <mailto:acopost-devel@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/acopost-devel



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140


_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
acopost-devel mailing list
acopost-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acopost-devel

Reply via email to