Not obvious why that change would have broken anything, and certainly still absolutely no idea why it only happens in a full static build.

At that point, I would slightly bet on the fact that your whole application would have another component using flatbuffers at a different version, which wouldn't have the new vector_downward::size_ member. Although I would expect that static linking would be in a better position to detect duplicated symbols than dynamic linking...

One thing we didn't do in GDAL is to add a GDAL specific namespace around its flatbuffers component (we did that in MapServer to avoid potential conflicts between MapServer's flatbuffers copy with the GDAL one)

An interesting experiment would be to revert https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688 but add a unused size_ member to see if that's enough to break things.  Or just scrumble a bit the order of members of vector_downward.

Or try replacing the "flatbuffers" namespace by something like "gdal_flatbuffers"


Simon



On Sat, Feb 24, 2024 at 5:27 PM Simon Eves <simon.e...@heavy.ai> wrote:

    OK, so I tried a custom build of 3.7.3 with the latest
    *flatbuffers* (23.5.26), which was a drop-in replacement for 2.0.6
    other than the version asserts.

    This does not exhibit the original problem either.

    However, while it produces files which the stock static build, the
    static build with the older *flatbuffers* (2.0.0), and the Ubuntu
    dynamic build, can all read just fine, it is unable to read ANY
    files back in again (in the context of our server geo importer,
    anyway).

    GDAL throws a *CE_Failure* of *Header failed consistency
    verification (1), *which is from *OGRFlatGeobufLayer::Open(),* and
    the dataset reports no layers (or at least, no vector layers).

    This also appears to be a side-effect of it being a static build,
    as *ogrinfo* built from the same source (with *flatbuffers*
    2.0.0), but in regular shared libs mode, can read all three files
    just fine. I have been unable to achieve a full-static tools
    build, so I can't try that right now.

    This either means that the problem is still there in some form in
    the latest *flatbuffers*, but has moved, or that the higher-level
    FGB file schema verification can be affected by the *flatbuffers*
    version. Both are equally concerning.

    Anyway, the build with the older *flatbuffers* 2.0.0 extracted
    from the v3.5.3 tree (with the *Table::VerifyField* mod) seems to
    work fine in all ways, so we're probably gonna go with that, in
    the absence of anything else.

    One other weirdness is that, of the three files, the two produced
    by the static builds (*flatbuffers* 2.0.0 and *flatbuffers*
    23.5.26) are 16 bytes longer than the one from the Ubuntu dynamic
    build. All three read just fine with *ogrinfo* and our server geo
    importer, and result in the same table. Here is a link to a bundle
    with all three files plus the GeoJSON equivalent (*MULTIPOLYGON*
    US states with some metadata).

    
https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing

    As ever, happy to get into the weeds with more details of the
    original problem, but pretty sure that 95% of the readers of this
    list don't want this thread to get any longer! :)

    Simon


    On Fri, Feb 23, 2024 at 3:46 PM Simon Eves <simon.e...@heavy.ai>
    wrote:

        Our emails crossed. I am indeed testing with the latest
        flatbuffers now too.

        Agreed on the rest.

        On Fri, Feb 23, 2024 at 3:42 PM Even Rouault
        <even.roua...@spatialys.com> wrote:

            Simon,

            did you try to update to the latest
            https://github.com/google/flatbuffers/releases to see if
            that would solve the issue ? If that worked, that would be
            the best way forward...

            Otherwise if the issue persists with the latest
            flatbuffers release, a (admitedly rather tedious) option
            would be to do a git bisect on the flatbuffers code to
            identify the culprit commit. With some luck, the root
            cause might be obvious if a single culptrit commit can be
            exhibited (perhaps some subtle C++ undefined behaviour
            triggered? also it is a bit mysterious that it hits only
            for static builds), or otherwise raise to the upstream
            flatbuffers project to ask for their expertise

            Even

            Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
            I was able to create a fork of 3.7.3 with just the
            *flatbuffers* replaced with the pre-3.6.x version (2.0.0).

            This seemed to only require changes to the
            version asserts and adding an *align* parameter to
            *Table::VerifyField()* to match the newer API.

            
https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0

            Our system works correctly and passes all GDAL I/O tests
            with that version. Obviously this isn't an ideal
            solution, but this is otherwise a release blocker for us.

            I would still very much like to discuss the original
            problem more deeply, and hopefully come up with a better
            solution.

            Yours hopefully,

            Simon



            On Thu, Feb 22, 2024 at 10:22 PM Simon Eves
            <simon.e...@heavy.ai> wrote:

                Thank you, Robert, for the RR tip. I shall try it.

                I have new findings to report, however.

                First of all, I confirmed that a build against GDAL
                3.4.1 (the version we were on before) still works. I
                also confirmed that builds against 3.7.3 and 3.8.4
                still failed even with no additional library
                dependencies (just sqlite3 and proj), in case it was
                a side-effect of us also adding more of those. I then
                tried 3.5.3, with the CMake build (same config as we
                use for 3.7.3) and that worked. I then tried 3.6.4
                (again, same CMake config) and that failed. These
                were all from bundles.

                I then started delving through the GDAL repo itself.
                I found the common root commit of 3.5.3 and 3.6.4,
                and all the commits in the
                *ogr/ogrsf_frmts/flatgeobuf* sub-project between that
                one and the final of each. For 3.5.3, this was only
                two. I built and tested both, and they were fine. I
                then tried the very first one that was new in the
                3.6.4 chain (not in the history of 3.5.3), which was
                actually a bulk update to the *flatbuffers*
                sub-library, committed by Bjorn Harrtell on May 8
                2022 (SHA f7d8876). That one had the issue. I then
                tried the immediately-preceding commit (an unrelated
                docs change) and that one was fine.

                My current hypothesis, therefore, is that the
                *flatbuffers* update introduced the issue, or at
                least, the susceptibility of the issue.

                I still cannot explain why it only occurs in an
                all-static build, and even less able to explain why
                it only occurs in our full system and not with the
                simple test program against the very same static lib
                build that does the very same sequence of GDAL API
                calls, but I repeated the build tests of the commits
                either side and a few other random ones a bit further
                away in each direction, and the results were
                consistent. Again, it happens with both GCC 11 and
                Clang 14 builds, Debug or Release.

                I will continue tomorrow to look at the actual
                changes to *flatbuffers* in that update, although
                they are quite significant. Certainly, the
                *vector_downward* class, which is directly involved,
                was a new file in that update (although on inspection
                of that file's history in the *google/flatbuffers*
                repo, it seems it was just split out of another header).

                Bjorn, I don't mean to call you out directly, but I
                am CC'ing you to ensure you see this, as you appear
                to be a significant contributor to the *flatbuffers*
                project itself. Any insight you may have would be
                very welcome. I am of course happy to describe my
                debugging findings in more detail, privately if you
                wish, rather than spamming the list.

                Simon






                On Tue, Feb 20, 2024 at 1:49 PM Robert Coup
                <robert.c...@koordinates.com> wrote:

                    Hi,

                    On Tue, 20 Feb 2024 at 21:44, Robert Coup
                    <robert.c...@koordinates.com> wrote:

                        Hi Simon,

                        On Tue, 20 Feb 2024 at 21:11, Simon Eves
                        <simon.e...@heavy.ai> wrote:

                            Here's the stack trace for the original
                            assert. Something is stepping on scratch_
                            to make it 0x1000000000 instead of null,
                            which it starts out as when the
                            flatbuffer object is created, but by the
                            time it gets to allocating memory, it's
                            broken.


                        What happens if you set a watchpoint in gdb
                        when the flatbuffer is created?

                        watch -l myfb->scratch
                        or watch *0x1234c0ffee


                    Or I've also had success with Mozilla's rr:
                    https://rr-project.org/ — you can run to a point
                    where scratch is wrong, set a watchpoint on it,
                    and then run the program backwards to find out
                    what touched it.

                    Rob :)


            _______________________________________________
            gdal-dev mailing list
            gdal-dev@lists.osgeo.org
            https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- http://www.spatialys.com
            My software is free, but my time generally not.

--
http://www.spatialys.com
My software is free, but my time generally not.
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to