On 12/02/2016 16:32, Tiago Tresoldi wrote: > ["generic" taggers] > > Ok, if those words are mine, I referred to a different concept: the fact > that the interface should be general and not impose limits like the fact that > the input is coming > from a file. > On the other end I completely agree with you that it would be nice to let > people tag generic data. However this will involve two changes: > 1) we need to define a format that is capable of handling binary data. If > we want to keep compatibility with cooked format, we must think about how to > properly escape > spaces and new lines in the first place and how to encode binary string. > > > Oh yes, about the generality of the interface, I was assuming that was > obvious to everyone. :) > > Regarding generic _data_, maybe your proposal is too generic. I think we can > expect the data to be tagged to be textual in the sense that it is a bag of > strings, utf-8 > encoded, with printable characters. The encoding of data into some utf-8 > strings, the tokenization and any eventual textual transformation (to be > passed as a pointer to a > function?) should be up to the user -- even if some binary data ends up > encoded as hexadecimal strings or sequences of zeros and ones (I mean, ASCII > 30 and 31). The cooked > format, or any format that we might come up to replace it, should always be > human readable, in my opinion.
If it is going to be textual, I agree. :-) > [unicode] > > Yes we agree. :-) The only default we can reasonably try to implement > without libICU is latin1 support. With libICU it is possible to support all > the languages whose texts > can be written in UTF-8. I agree that we can provide implementation using > libICU and a replacement supporting the whole range of latin1. > If possible, design the code so that this can be changed by a program, > without touching the code. > > > Yes, latin1 is probably enough for most people. I would also add some support > for Greek alphabet, but once more I am thinking about my needs and this can > wait. This would be very hard to achieve without libICU. Anyway, thinking about it a little bit more, probably the default should be ASCII. In this way we will not create troubles to users. I isolated the relevant functions and, unfortunately, they where a little bit more than I expected... But we can deal with them easily anyway. > [scripting and python] > > > On the other hand I agree that Perl syntax is not very nice. My > points in keep using perl are: 1) it is already used and, unless we are > replacing all the tools we have in > > acopost, adding a script in a different language will also add a > new dependency; 2) the behaviour is much more predictable than python usually > (e.g., no automatic charset > > conversion happens when writing to stdout or reading from stdin, so > that it is not possible that a program will fail just because you are using a > different terminal > > emulator, a different locale or are redirecting to a file); 3) perl > is installed by default on Debian (even when minimal installation), so it is > a no dependency there. > > > I too had my problems with Python's unicode handling, particularly when > fetching multilanguage data from Wikipedia, but my experience is far better. > Anyway, regarding Perl, > I would still vote for its replacement: a scripting language only makes sense > (for acopost, I mean) if it makes the system easier to the end user, and in > my opinion Perl is > too complex and, especially, unknown. In a way, it has become a niche > language, replaced by Python and to some extent Ruby and JS: it is a bit > ironic that a language > designed by a linguist, which we could assume to be familiar to the end user > of acopost during Ingo's thesis, is now hardly used by the same public > (please keep in mind > that I am talking from my experience in Brazil, yours might be radically > different). It is not much different. My opinion with respect to make end user life easy is: 1) avoid dependencies as much as possible; 2) avoid unexpected behaviours as much as possible. I do not expect end users to edit those scripts, unless there is a bug that they must fix. It is not much ironic in my opinion, because actually Perl fits better than other languages only when dealing with text processing and most of the programmers do not do that in their life, while people doing text processing mostly rely on other programmers and examples for programming-related stuff. Perl has an horrible syntax in my opinion, that lets anybody write unreadable programs. Python on the other end has a nice syntax and do a lot of magical automatic things that makes programs easier to write. Apparently, because when you write without overriding them you cannot be sure that your script will work on another computer... > Bash and Perl are a default (but I'd say it is pretty reasonable to expect > that an acopost developer has Python installed, at least version 2.7), so we > can keep what we > have, no need to translate it into Python. Maybe we can agree that new > scripts (such as for voting) are added in any of these languages, in the > order of preference > C-Bash-(Perl/Python)? I think this is acceptable. > Regarding Lua, I too would favor it if we decided for an embedded scripting > engine, but there's already so much to be done what we must postpone it to > when (if ever) needed. > > [language models] > >> I am between A and B, closer to A. But given this is not a problem yet, I'd >> keep them in current repository, moving them to their own when (and if) >> needed. > > > If you think there could ever be this need, then A is not a viable > possibility. Once you store a file in git, it is not possible to remove it > from the history and every > person cloning a repository is going to clone the whole history. So, > decision B and C can be reverted, A is forever. > An hybrid approach could be to have a separate repository for language > models and sources and have the language models imported automatically in > sources using git > submodules. > Sources will maintain only one link in the history and people retrieving > acopost with git will get everything. > > > I don't think that the languages in the whole history are a problem, the git > tree is for developers and it could make it easier to integrate a standard > language model into > any deployed system. Mind that the language models themselves are not really > large, and if we ever collect a large number of models we can always migrate > them, even if > (alas) the previous are kept in the history. > > However, the hybrid approach seems elegant, and you have my vote here, too. Fine. > [TAP] > > TAP is just a way to define the output of the tests. The main benefit is > that it is quite general and machine readable. > It is perfect for unit testing (e.g.: to keep track of "just a sequence > of asserts to test function calls"), integration testing and regressions > testing. > > Probably it is easier if I setup an example and then eventually revert if > you do not like it. > > > I have seen your example, it is interesting and good. > > > > We can keep the discussion, but I have some important words to say. I am not > the owner of acopost, and in fact, considering the four people that > contributed to the code, I > am probably the one that did less. I am really glad that you and Ulrik > joined, and I think that we should move according to the good free software > practice: those who > contribute the most (and I really mean contribute, not those who try to fit a > project to their needs, as I in way am still doing) should usually have their > way on the > development. I have not been very active and I am not sure if I will be in > the next months (many thinks going in my life), so please, have your way. I > am only saying that > because sometimes it seems that my opinion should matter the most. :) I am also a lover of the do-ocracy, but I think that, as a team, we should also agree on tasks and directions. > Two more things. First, I am curious about it -- are you using acopost for > something "serious" (if you can tell us, of course); second, how did you > generate the random > corpus you pushed? I was thinking more of a script that would generate it on > the fly, so we could have infinite random corpora (generated from user > specified seeds, of > course), not one standard random corpus stored in the tree. I can tell... Funny enough I am not using acopost, but hunpos and opennlp. But my goal is to use it for speech synthesis in order to: 1) identify content words and 2) provide enough context to disambiguate words pronounciations. With a generic tagger I would also try to predict intonation labels from a pos-tagged sentence. Regarding the corpus, I did not push it, it was already there. According to git you added it in 2013. I added the test1 script the day after. Now I just converted it to produce TAP output and to not need autoconf anymore. Cheers, Giulio ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ acopost-devel mailing list acopost-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/acopost-devel