gdb) p m_fbb $5 = (flatbuffers::FlatBufferBuilder &) @0x7fffffff9070: {static kFileIdentifierLength = 4, buf_ = {allocator_ = 0x0, own_allocator_ = false, initial_size_ = 1024, buffer_minalign_ = 8, reserved_ = 0, size_ = 0, buf_ = 0x0, cur_ = 0x0, scratch_ = 0x1000000000000 <error: Cannot access memory at address 0x1000000000000>}, num_field_loc = 8, max_voffset_ = 0, nested = false, finished = false, minalign_ = 8, force_defaults_ = false, dedup_vtables_ = true, string_pool = 0x0}
On Tue, Feb 20, 2024 at 1:10 PM Simon Eves <simon.e...@heavy.ai> wrote: > Here's the stack trace for the original assert. Something is stepping on > scratch_ to make it 0x1000000000 instead of null, which it starts out as > when the flatbuffer object is created, but by the time it gets to > allocating memory, it's broken. > > On Tue, Feb 20, 2024 at 1:05 PM Simon Eves <simon.e...@heavy.ai> wrote: > >> (starting a new thread to avoid derailing the static-build one any >> further) >> >> Totally agreed on the mismatch idea, but the code in question is all >> self-contained down in *ogr/ogrsf_frmts/flatgeobuf* and the *flatbuffers* >> sub-project (which is a snapshot of a Google OSS project) so I'm struggling >> to see how there could be a mismatch. >> >> Also, although we're building on CentOS 7, we're using relatively new >> compilers (GCC 11.4 and Clang 14.0.6), and we bundle the matching newer >> runtimes. >> >> We don't have a full static build stack on our normal dev platform >> (Ubuntu 22.04) so I haven't been able to repro the problem there. >> >> I should have mentioned the first time that we have tried using ASAN, and >> it definitely catches something wrong, but the behavior is different, and >> varies if you add more debug printfs. For example: >> >> DEBUG: vector_downward::push() num = 16 >> DEBUG: about to reallocate, buf_ = 0, cur_ = 0, scratch = 0 >> DEBUG: reallocated, buf_ = 0x61900062d380, cur_ = 0x61900062cf80, scratch >> = 0 >> DEBUG: vector_downward::push() ptr = 0x61900062cf70, about to do memcpy >> ================================================================= >> ==25459==ERROR: AddressSanitizer: heap-buffer-overflow on address >> 0x61900062cf70 at pc 0x7f8933eb87f6 bp 0x7fffa7aa0e70 sp 0x7fffa7aa0620 >> WRITE of size 16 at 0x61900062cf70 thread T0 >> >> ...but it's still not obvious what exactly is going wrong. The code and >> data flow makes perfect sense when you step through it in a dynamic build >> that doesn't fail. >> >> Like I said, the frustrating part is that a simple test program >> (attached) compiled against the same set of static libs works fine. >> >> S >> >> On Tue, Feb 20, 2024 at 12:33 PM Robert Coup <robert.c...@koordinates.com> >> wrote: >> >>> Hi Simon, >>> >>> On Tue, 20 Feb 2024 at 18:58, Simon Eves via gdal-dev < >>> gdal-dev@lists.osgeo.org> wrote: >>> >>>> We still have one VERY strange issue whereby FlatGeoBuf export fails in >>>> a very consistent and reproducible form down in the flatbuffer code, but >>>> only in the static build, and only in the full system. I have written a >>>> simple test harness that links the very same static libgdal and does a >>>> simple GDAL startup and FGB export of a single feature and that works fine. >>>> It's some kind of data/stack corruption when it first tries to write to the >>>> flatbuffer on the first feature, which results in a pointer member of the >>>> buffer class becoming 0x100000000000 (always) instead of null, and then it >>>> stops on an assert. There is also one private function in the >>>> vector_downward class which the debugger won't even step into in that >>>> build. I can even put printfs in that function and they don't come out. >>>> I've tried it on CentOS and on Ubuntu, with GCC and Clang, and it's always >>>> the same. Everything else in GDAL works just fine (we have LOTS of >>>> import/export unit tests). This makes zero sense as all the FGB code is >>>> internal to GDAL and compiled together. I've been poking at it for over a >>>> week and it's doing my head in. >>>> >>> >>> One cause of this sort of crash is a header/library mismatch somewhere >>> where a function is expecting different parameters/types than the caller is >>> actually providing. Otherwise, maybe a bug in glibc/libstdc++/gcc/something >>> that's been fixed in the intervening ten years since CentOS 7 was released? >>> >>> >>> If you run your *build* on a modern distro/libc/gcc/etc does it change >>> things? If it's the same, maybe hints more towards the former. >>> >>> ASAN (https://github.com/google/sanitizers/wiki/AddressSanitizer) might >>> help track down stack/heap corruption. >>> >>> Rob :) >>> >>>
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev