Simon,

did you try to update to the latest https://github.com/google/flatbuffers/releases to see if that would solve the issue ? If that worked, that would be the best way forward...

Otherwise if the issue persists with the latest flatbuffers release, a (admitedly rather tedious) option would be to do a git bisect on the flatbuffers code to identify the culprit commit. With some luck, the root cause might be obvious if a single culptrit commit can be exhibited (perhaps some subtle C++ undefined behaviour triggered? also it is a bit mysterious that it hits only for static builds), or otherwise raise to the upstream flatbuffers project to ask for their expertise

Even

Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
I was able to create a fork of 3.7.3 with just the *flatbuffers* replaced with the pre-3.6.x version (2.0.0).

This seemed to only require changes to the version asserts and adding an *align* parameter to *Table::VerifyField()* to match the newer API.

https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0

Our system works correctly and passes all GDAL I/O tests with that version. Obviously this isn't an ideal solution, but this is otherwise a release blocker for us.

I would still very much like to discuss the original problem more deeply, and hopefully come up with a better solution.

Yours hopefully,

Simon



On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.e...@heavy.ai> wrote:

    Thank you, Robert, for the RR tip. I shall try it.

    I have new findings to report, however.

    First of all, I confirmed that a build against GDAL 3.4.1 (the
    version we were on before) still works. I also confirmed that
    builds against 3.7.3 and 3.8.4 still failed even with no
    additional library dependencies (just sqlite3 and proj), in case
    it was a side-effect of us also adding more of those. I then tried
    3.5.3, with the CMake build (same config as we use for 3.7.3) and
    that worked. I then tried 3.6.4 (again, same CMake config) and
    that failed. These were all from bundles.

    I then started delving through the GDAL repo itself. I found the
    common root commit of 3.5.3 and 3.6.4, and all the commits in the
    *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the
    final of each. For 3.5.3, this was only two. I built and tested
    both, and they were fine. I then tried the very first one that was
    new in the 3.6.4 chain (not in the history of 3.5.3), which was
    actually a bulk update to the *flatbuffers* sub-library,
    committed by Bjorn Harrtell on May 8 2022 (SHA f7d8876). That one
    had the issue. I then tried the immediately-preceding commit (an
    unrelated docs change) and that one was fine.

    My current hypothesis, therefore, is that the *flatbuffers* update
    introduced the issue, or at least, the susceptibility of the issue.

    I still cannot explain why it only occurs in an all-static build,
    and even less able to explain why it only occurs in our full
    system and not with the simple test program against the very same
    static lib build that does the very same sequence of GDAL API
    calls, but I repeated the build tests of the commits either side
    and a few other random ones a bit further away in each direction,
    and the results were consistent. Again, it happens with both GCC
    11 and Clang 14 builds, Debug or Release.

    I will continue tomorrow to look at the actual changes to
    *flatbuffers* in that update, although they are quite significant.
    Certainly, the *vector_downward* class, which is directly
    involved, was a new file in that update (although on inspection of
    that file's history in the *google/flatbuffers* repo, it seems it
    was just split out of another header).

    Bjorn, I don't mean to call you out directly, but I am CC'ing you
    to ensure you see this, as you appear to be a significant
    contributor to the *flatbuffers* project itself. Any insight you
    may have would be very welcome. I am of course happy to describe
    my debugging findings in more detail, privately if you wish,
    rather than spamming the list.

    Simon






    On Tue, Feb 20, 2024 at 1:49 PM Robert Coup
    <robert.c...@koordinates.com> wrote:

        Hi,

        On Tue, 20 Feb 2024 at 21:44, Robert Coup
        <robert.c...@koordinates.com> wrote:

            Hi Simon,

            On Tue, 20 Feb 2024 at 21:11, Simon Eves
            <simon.e...@heavy.ai> wrote:

                Here's the stack trace for the original assert.
                Something is stepping on scratch_ to make it
                0x1000000000 instead of null, which it starts out as
                when the flatbuffer object is created, but by the time
                it gets to allocating memory, it's broken.


            What happens if you set a watchpoint in gdb when the
            flatbuffer is created?

            watch -l myfb->scratch
            or watch *0x1234c0ffee


        Or I've also had success with Mozilla's rr:
        https://rr-project.org/ — you can run to a point where scratch
        is wrong, set a watchpoint on it, and then run the program
        backwards to find out what touched it.

        Rob :)


_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

--
http://www.spatialys.com
My software is free, but my time generally not.
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to