Re: [Apertium-stuff] Issues with apertium-tagger

Xavi Ivars Tue, 21 Aug 2018 16:41:29 -0700

Found out what the problem was. It's REALLY weird, but documented in
several places, including the standard ([1], [2])


The problem is with this piece of code

#include
int main() {
  std::map m;
  m[0] = m.size();
}


It seems that in gcc 6.x.x, m[1] got the value of 1. The reason for that is
the left size of the assignment is evaluated first to obtain a reference,
and as [] creates a new element if it doesn't exist. So when m.size() is
evaluated, the size is actually 1.

Clang, on the other hand, was evaluating first m.size(), getting 0, and
then assigning that as the value of m[0].

With gcc 7 [3], finally a change part of C++2017 called "Refining
Expression Evaluation Order for Idiomatic C++" got supported [4]. And it
seems that this brings the clang behavior to the old C++, which basically
breaks apertium-tagger here [5]

index[t] = index.size()-1;

I've added a proposal for a fix, by keeping the way Collection.cc was
implemented but without relying on any specific version of C++.

int position = index.size();
index[t] = position;

We could also fix it doing things like

#if __GNUC__ >= 7
index[t] = index.size();
#else
index[t] = index.size()-1;
#endif

or with the __cplusplus macro instead. But I personally think the proposed
fix is better.

[1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2014/n4228.pdf
[2]
https://blog.jayway.com/2015/09/08/undefined-behaviour-in-c-when-adding-to-map/
[3]
https://www.bfilipek.com/2017/12/cpp-status-2017.html#compiler-support-for-c17
[4] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0145r3.pdf
[5]
https://github.com/apertium/apertium/blob/master/apertium/collection.cc#L48

Missatge de Xavi Ivars <xavi.iv...@gmail.com> del dia ds., 18 d’ag. 2018 a
les 20:19:

> Ok, quite a lot of more (useful) information regarding this.
>
> First of all, I created a branch that prints a lot of debug information
> (probabilities, etc), that is only be useful for this specific
> investigation. It'd be worth, though, do it properly and keep some of that
> information for the existing debug mode.
>
> https://github.com/apertium/apertium/tree/logging-hmm
>
> Now, the data.
>
> In a machine with Debian stretch, that works properly:
>
> echo '^cumbre/cumbre<n><f><sg>$ ^en/en<pr>$
> ^Madrid/Madrid<np><ant>/Madrid<np><loc>$^./.<sent>$' |
> apertium/apertium-tagger -gdmf /src/apertium-spa-cat/spa-cat.prob
> WORD = ({NOMF} Word: cumbre) TAGSET: 22,4 - Prob: 0.0323806
> ^cumbre/cumbre<n><f><sg>$END: Word: {NOMF} Word: cumbre
> WORD = ({PREP} Word: en) TAGSET: 43,22 - Prob: 0.23551
>  ^en/en<pr>$END: Word: {PREP} Word: en
> WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 36,43 - Prob:
> 0.000393184
> WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 37,43 - Prob:
> 0.000966503
> END: Word: {ANTROPONIM,TOPONIM} Word: Madrid
> WORD = ({TAG_SENT} Word: .) TAGSET: 4,36 - Prob: 6.08016e-05
> WORD = ({TAG_SENT} Word: .) TAGSET: 4,37 - Prob: 0.000192669
>  ^=Madrid/Madrid<np><loc>/Madrid<np><ant>$^./.<sent>$END: Word: {TAG_SENT} 
> Word:
> .
> WORD = ({TAG_kEOF} Word: ) TAGSET: 5,4 - Prob: 0.00133422
>
>
> In this case, tag 43 is PREP, tag 36 is ANTROPONIM (np.ant), tag 37 is
> np.loc (TOPONIM). and tag 4 is SENT. We can see in the first yellow line
> that the probability of prep + np.loc is 3x the probability of prep +
> np.ant.  Similarly, np.loc + sent is quite higher than np.ant + sent.
>
> Overall, this makes apertium-tagger choice an easy one: np.loc over np.ant
>
> Now, same results in a machine running Ubuntu 18.04 (bionic). Just to make
> sure, both machines are running latest lttoolbox (from nighlty package),
> with latest apertium-tagger (from code), with same probability file.
>
> $ echo '^cumbre/cumbre<n><f><sg>$ ^en/en<pr>$
> ^Madrid/Madrid<np><ant>/Madrid<np><loc>$^./.<sent>$' |
> apertium/apertium-tagger -gdmf ~/src/apertium/apertium-spa-cat/spa-cat.prob
> WORD = ({NOMF} Word: cumbre) TAGSET: 22,4 - Prob: 4.53636e-12
> ^cumbre/cumbre<n><f><sg>$END: Word: {NOMF} Word: cumbre
> WORD = ({PREP} Word: en) TAGSET: 43,22 - Prob: 2.56179e-11
>  ^en/en<pr>$END: Word: {PREP} Word: en
> WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 36,43 - Prob:
> 0.000561191
> WORD = ({ANTROPONIM,TOPONIM} Word: Madrid) TAGSET: 37,43 - Prob:
> 5.13191e-12
> END: Word: {ANTROPONIM,TOPONIM} Word: Madrid
> WORD = ({TAG_SENT} Word: .) TAGSET: 4,36 - Prob: 8.67821e-15
> WORD = ({TAG_SENT} Word: .) TAGSET: 4,37 - Prob: 1.02303e-22
>  ^=Madrid/Madrid<np><ant>/Madrid<np><loc>$^./.<sent>$END: Word: {TAG_SENT} 
> Word:
> .
> WORD = ({TAG_kEOF} Word: ) TAGSET: 5,4 - Prob: 1.33422e-13
>
>
> I've highlighted in this case the rows that make the tagger prefer np.ant
> instead of np.loc. Probabilities arehigher, so the decission is also clear.
> But it is very weird that proabilities are different with the same input
> and the same .prob file. And not only for this, we can see the same thing
> for every single probability computed by the tagger: *all of them are
> different.*
>
>  Not sure how we should proceed about this, but IMHO is quite concerning
> to have this type of inestability (can we call it "bug"?😊) in the core of
> apertium's pipeline.
>
> --
> < Xavi Ivars >
> < http://xavi.ivars.me >
>


-- 
< Xavi Ivars >
< http://xavi.ivars.me >

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Issues with apertium-tagger

Reply via email to