Hi,
Now with a backtrace. Third time it's failed with 16 cores on the same
phrase table. All runs already had "-encoding None."
#0 0x0000000000421c5d in
Moses::Simple9::Encode<__gnu_cxx::__normal_iterator<unsigned int*,
std::vector<unsigned int, std::allocator<unsigned int> > >,
std::back_insert_iterator<std::vector<unsigned int,
std::allocator<unsigned int> > > > (it=..., end=..., outIt=...,
outIt@entry=...) at moses/TranslationModel/CompactPT/ListCoders.h:339
#1 0x00000000004222b4 in Moses::MonotonicVector<unsigned long, unsigned
int, 32ul, std::allocator>::push_back (this=this@entry=0xbebe3258,
i=3540308603) at moses/TranslationModel/CompactPT/MonotonicVector.h:109
#2 0x000000000042d344 in Moses::StringVector<unsigned char, unsigned
long, Moses::MmapAllocator>::push_back<std::string> (this=0xbebe3240,
s=...) at moses/TranslationModel/CompactPT/StringVector.h:386
#3 0x00000000004179a2 in FlushCompressedQueue (force=false,
this=0x7fffffffc550) at
moses/TranslationModel/CompactPT/PhraseTableCreator.cpp:986
#4 Moses::CompressionTask::operator() (this=0xbebe5378) at
moses/TranslationModel/CompactPT/PhraseTableCreator.cpp:1230
#5 0x00000000004678ea in thread_proxy ()
#6 0x0000003a03007851 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003a024e890d in clone () from /lib64/libc.so.6
Looking at the code:
double log2 = log(2);
while(j < 9 && lastpos < 28 && (i+lastpos) < end) {
if(lastpos >= parts[j])
j++;
buffer[lastpos] = *(i + lastpos);
uint reqbit = ceil(log(buffer[lastpos]+1)/log2);
assert(reqbit <= 28);
// CRASH HERE
uint bit = 28/floor(28/reqbit);
if(lastbit < bit)
lastbit = bit;
if(parts[j] > 28/lastbit)
break;
else if(lastpos == parts[j]-1)
lastyes = lastpos;
lastpos++;
}
reqbit is 0 and 28/reqbit is triggering an integer divide by zero. Yes,
floating point exception is a misnomer and usually means integer divide
by zero, since it covers both types but NaNs are usually set to
non-signaling.
What is the problematic line "uint bit = 28/floor(28/reqbit);" trying to
do? Currently:
1. Integer division 28/reqbit, returning an integer.
2. Cast that integer to a float.
3. Call floor which should do nothing at this small scale.
4. Floating point divide 28.0 by the result.
5. Convert to integer, rounding down. If the floating-point operation
is imprecise, you'll get something lower that 28/(28/reqbit).
Moreover, it looks like there's some floating-point arithmetic to do
integer log2.
uint reqbit = ceil(log(buffer[lastpos]+1)/log2);
How about gcc's builtin, which is one asm instruction (if gcc is the
compiler)?
int __builtin_clz (unsigned int x)
But anyway buffer[lastpos] == 0 so the above integer log2 code is
correctly returning 0 == log2(0 + 1)
Tracing back a bit more, the function is attempting to encode a vector
containing the following integers: 0 118 128 72 63 71 64 114 41 74 46
375 374 425 112 502 496 485 474 493 106 110 104 110 115 296 287 105 113
0 0 . It's barfing on the 0th entry in that vector, which is a zero.
Sometimes Simple-9 doesn't expect 0s since it's delta encoding for
posting lists etc. Is the bug that 0s are being passed or that the
encoding scheme isn't handling this case?
Kenneth
On 01/13/2015 02:25 AM, Marcin Junczys-Dowmunt wrote:
> Hi Kenneth.
> Recently I am encountering an increased number of crashes, too. I guess
> there are some heisenbugs in the binarization that manifest maybe due to
> a new boost version or something. A workaround is usually to use less
> threads, only one or up to 4 (it's actually not much faster with 16
> anyway). If it still crashes try -encoding None . I am planning to write
> a new binarization tool from scratch, this one is giving me too much
> headache.
>
> W dniu 13.01.2015 o 04:20, Kenneth Heafield pisze:
>> Dear Moses/Marcin,
>>
>> I'm getting a Floating point exception in processPhraseTableMin from
>> Moses d0807c.
>>
>> Arguments, minus the absolute paths, are:
>>
>> processPhraseTableMin -in phrase-table.gz -out phrase-table -nscores 4
>> -threads 16 -T /tmp -encoding None
>>
>> The phrase table is rather large and it runs for several hours before
>> crashing. Log output is below.
>>
>> Used options:
>> Text phrase table will be read from: phrase-table.gz
>> Output phrase table will be written to: phrase-table.minphr
>> Step size for source landmark phrases: 2^10=1024
>> Source phrase fingerprint size: 16 bits / P(fp)=1.52588e-05
>> Selected target phrase encoding: Huffman
>> Number of score components in phrase table: 4
>> Single Huffman code set for score components: no
>> Using score quantization: no
>> Explicitly included alignment information: yes
>> Running with 16 threads
>>
>> Pass 1/2: Creating source phrase index + Encoding target phrases
>> ..................................................[5000000]
>> ..................................................[10000000]
>> ..................................................[15000000]
>> ..................................................[20000000]
>> ..................................................[25000000]
>> ..................................................[30000000]
>> ..................................................[35000000]
>> ..................................................[40000000]
>> ..................................................[45000000]
>> ..................................................[50000000]
>> ..................................................[55000000]
>> ..................................................[60000000]
>> ..................................................[65000000]
>> ..................................................[70000000]
>> ..................................................[75000000]
>> ..................................................[80000000]
>> ..................................................[85000000]
>> ..................................................[90000000]
>> ..................................................[95000000]
>> ..................................................[100000000]
>> ..................................................[105000000]
>> ..................................................[110000000]
>> ..................................................[115000000]
>> ..................................................[120000000]
>> ..................................................[125000000]
>> ..................................................[130000000]
>> ..................................................[135000000]
>> ..................................................[140000000]
>> ..................................................[145000000]
>> ..................................................[150000000]
>> ..................................................[155000000]
>> ..................................................[160000000]
>> ..................................................[165000000]
>> ..................................................[170000000]
>> ..................................................[175000000]
>> ..................................................[180000000]
>> ..............................................
>>
>> Intermezzo: Calculating Huffman code sets
>> Creating Huffman codes for 624564 target phrase symbols
>> Creating Huffman codes for 551381 scores
>> Creating Huffman codes for 15296482 scores
>> Creating Huffman codes for 582875 scores
>> Creating Huffman codes for 15806633 scores
>> Creating Huffman codes for 50 alignment points
>>
>> Pass 2/2: Compressing target phrases
>> ..................................................[5000000]
>> ..................................................[10000000]
>>
>> Kenneth
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support