Hi Kennon,
On Tue, Feb 24, 2026 at 10:38:01AM -0800, cygwin wrote:
> Hello,
>
> I am having a problem with that is apparently related to memmove and
> looking for some advice on how to investigate further. This winter I have
> been working to simplify GLZA source code and make it more readable. GLZA is
> an advanced open source code straight line grammar compressor first released
> in 2015. Among these changes was replacing some rather bloated code with
> memmove and memset in various locations. The program started crashing
> occassionally and after extensively reviewing the changes, I was unable to
> find a cause for these crashes. So I installed gdb to try to find out what
> was going on and was apparently able to find the cause of the problem. As a
> new gdb user, I am not very comfortable with trusting the results of what gdb
> showing, but it is pointing directly to one of the code changes I made. I
> backed out of this code change and the program has not crashed after 3 days
> of nearly continuous testing.
>
> So here is what gdb reports when backtrace is run immediately after
> reporting a "SIGTRAP":
>
> (gdb) bt full
> #0 0x00007ff9dd8aa98b in KERNELBASE!DebugBreak () from
> /cygdrive/c/Windows/system32/KERNELBASE.dll
> No symbol table info available.
> #1 0x00007ff9ca3b6417 in cygwin1!.assert () from
> /cygdrive/c/Windows/cygwin1.dll
> No symbol table info available.
> #2 0x00007ff9ca3cfb18 in secure_getenv () from /cygdrive/c/Windows/cygwin1.dll
> No symbol table info available.
> #3 0x00007ff9e03dd82d in ntdll!.chkstk () from
> /cygdrive/c/Windows/SYSTEM32/ntdll.dll
> No symbol table info available.
> #4 0x00007ff9e038916b in ntdll!RtlRaiseException () from
> /cygdrive/c/Windows/SYSTEM32/ntdll.dll
> No symbol table info available.
> #5 0x00007ff9e03dc9ee in ntdll!KiUserExceptionDispatcher () from
> /cygdrive/c/Windows/SYSTEM32/ntdll.dll
> No symbol table info available.
> #6 0x00007ff9ca3b12a9 in memmove () from /cygdrive/c/Windows/cygwin1.dll
> No symbol table info available.
> #7 0x0000000100409a7c in rank_scores_thread (arg=0x6ffece890010) at
> GLZAcompress.c:904
> new_score_rank = 2633
> new_score_lmi2 = 183964750
> new_score_pmi2 = 183964725
> rank = 4380
> max_rank = 2633
> num_symbols = 25
> new_score_lmi = 92079851
> new_score_pmi = 92079826
> thread_data_ptr = 0x6ffece890010
> max_scores = 4883
> candidates_index = 0xa00034470
> score_index = 4380
> node_score_num_symbols = 7
> num_candidates = 4381
> node_ptrs_num = 12224
> local_write_index = 12225
> rank_scores_buffer = 0x6ffece890020
> candidates = 0x6ffece990020
> score = 47.6283531
> #8 0x00007ff9ca412eec in cygwin1!.getreent () from
> /cygdrive/c/Windows/cygwin1.dll
> No symbol table info available.
> #9 0x00007ff9ca3b47d3 in cygwin1!.assert () from
> /cygdrive/c/Windows/cygwin1.dll
> No symbol table info available.
> #10 0x0000000000000000 in ?? ()
> No symbol table info available.
>
> GLZAcompress.c line 904 is as follows and is in code that runs as a separate
> thread created in main:
> memmove(&candidates_index[new_score_rank+1],
> &candidates_index[new_score_rank], 2 * (rank - new_score_rank));
> This does point directly to where a code change was made.
>
> candidates_index is allocated in main and not ever intentionally changed
> until deallocated at the end of program execution:
> if (0 == (candidates_index = (uint16_t *)malloc(max_scores *
> sizeof(uint16_t))))
> fprintf(stderr, "ERROR - memory allocation failed\n");
> This value is passed to the thread in a structure pointed to by the thread
> arg. The value 0xa00034470 for candidates_index is similar to what is
> reported on subsequent runs with added code to print this value so I don't
> think it's corrupted, but would need to duplicate the crash after checking
> the initial value to be 100% certain. With gdb reporting that rank = 4380
> and new_score_rank = 2633 at the time of the SIGTRAP, this should be a
> backward move of 1747 uint16_t values by 2 bytes with a 2 byte difference
> between the source and destination addresses.
>
> Prior to this code change and for the last 3 days I have been using this code
> instead and not seen any crashes:
> uint16_t * score_ptr = &candidates_index[new_score_rank];
> uint16_t * candidate_ptr = &candidates_index[rank];
> while (candidate_ptr >= score_ptr + 8) {
> *candidate_ptr = *(candidate_ptr - 1);
> *(candidate_ptr - 1) = *(candidate_ptr - 2);
> *(candidate_ptr - 2) = *(candidate_ptr - 3);
> *(candidate_ptr - 3) = *(candidate_ptr - 4);
> *(candidate_ptr - 4) = *(candidate_ptr - 5);
> *(candidate_ptr - 5) = *(candidate_ptr - 6);
> *(candidate_ptr - 6) = *(candidate_ptr - 7);
> *(candidate_ptr - 7) = *(candidate_ptr - 8);
> candidate_ptr -= 8;
> }
> while (candidate_ptr > score_ptr) {
> *candidate_ptr = *(candidate_ptr - 1);
> candidate_ptr--;
> }
> Yes, it's bloated code that should do the same thing as the memmove, but most
> importantly the code has never caused any problems. Interestingly, even this
> code shows memmove in the assembly code (gcc -S), but only for the second
> while loop. The looping code for the first while loop looks like this and
> moves 8 uint16_t's in just 5 instruction so it is perhaps not as inefficient
> as the source code looks:
> .L25:
> movdqu -16(%rax), %xmm1
> subq $16, %rax
> movups %xmm1, 2(%rax)
> cmpq %rdx, %rax
> jnb .L25
>
> It may or may not matter, but the code this is happening on is very CPU
> intensive - there can be up to 8 threads running at the same time when this
> problem occurs. The problem doesn't occur consistently, it seems to be
> rather random. The program runs about 500 iterations of ranking up to the
> top 30,000 new grammar rule candidates over nearly 4 hours on my test case
> and has crashed on different iterations each time it has crashed, even though
> the thread that seems to be crashing should be seeing exactly the same data
> each time the program is run. The malloc'ed array address could be changing,
> I haven't checked that out.
>
> I find it really hard to believe there is a bug in memmove but that seems to
> be what gdb and my testing are indicating. So I am looking for advice on how
> to better understand what is causing the program to crash. I would like to
> review the code memset is using, but have not been able to figure out how to
> track that down. Any help in understanding what code the complier is using
> for memmove would be helpful. Are there other things I could possibly be
> overlooking? Are the any other things I should review or report that would
> be helpful? I could try to write a simplified test case if that would be
> useful.
>
> Best Regards,
>
> Kennon Conrad
>
>
>
> --
> Problem reports: https://cygwin.com/problems.html
> FAQ: https://cygwin.com/faq/
> Documentation: https://cygwin.com/docs.html
> Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
The memmove() call acceses new_score_rank 3 times while the old code only
accessed it once. Is it possible that another CPU alters new_score_rank between
these acesses?
You could eliminate that possibility by making a local copy of new_score_rank
and using that in the memmove() call. Worth a try?
Cheers ... Duncan.
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple