Simon
On Sat, Feb 24, 2024 at 5:27 PM Simon Eves <simon.e...@heavy.ai> wrote:
OK, so I tried a custom build of 3.7.3 with the latest
*flatbuffers* (23.5.26), which was a drop-in replacement for 2.0.6
other than the version asserts.
This does not exhibit the original problem either.
However, while it produces files which the stock static build, the
static build with the older *flatbuffers* (2.0.0), and the Ubuntu
dynamic build, can all read just fine, it is unable to read ANY
files back in again (in the context of our server geo importer,
anyway).
GDAL throws a *CE_Failure* of *Header failed consistency
verification (1), *which is from *OGRFlatGeobufLayer::Open(),* and
the dataset reports no layers (or at least, no vector layers).
This also appears to be a side-effect of it being a static build,
as *ogrinfo* built from the same source (with *flatbuffers*
2.0.0), but in regular shared libs mode, can read all three files
just fine. I have been unable to achieve a full-static tools
build, so I can't try that right now.
This either means that the problem is still there in some form in
the latest *flatbuffers*, but has moved, or that the higher-level
FGB file schema verification can be affected by the *flatbuffers*
version. Both are equally concerning.
Anyway, the build with the older *flatbuffers* 2.0.0 extracted
from the v3.5.3 tree (with the *Table::VerifyField* mod) seems to
work fine in all ways, so we're probably gonna go with that, in
the absence of anything else.
One other weirdness is that, of the three files, the two produced
by the static builds (*flatbuffers* 2.0.0 and *flatbuffers*
23.5.26) are 16 bytes longer than the one from the Ubuntu dynamic
build. All three read just fine with *ogrinfo* and our server geo
importer, and result in the same table. Here is a link to a bundle
with all three files plus the GeoJSON equivalent (*MULTIPOLYGON*
US states with some metadata).
https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing
As ever, happy to get into the weeds with more details of the
original problem, but pretty sure that 95% of the readers of this
list don't want this thread to get any longer! :)
Simon
On Fri, Feb 23, 2024 at 3:46 PM Simon Eves <simon.e...@heavy.ai>
wrote:
Our emails crossed. I am indeed testing with the latest
flatbuffers now too.
Agreed on the rest.
On Fri, Feb 23, 2024 at 3:42 PM Even Rouault
<even.roua...@spatialys.com> wrote:
Simon,
did you try to update to the latest
https://github.com/google/flatbuffers/releases to see if
that would solve the issue ? If that worked, that would be
the best way forward...
Otherwise if the issue persists with the latest
flatbuffers release, a (admitedly rather tedious) option
would be to do a git bisect on the flatbuffers code to
identify the culprit commit. With some luck, the root
cause might be obvious if a single culptrit commit can be
exhibited (perhaps some subtle C++ undefined behaviour
triggered? also it is a bit mysterious that it hits only
for static builds), or otherwise raise to the upstream
flatbuffers project to ask for their expertise
Even
Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
I was able to create a fork of 3.7.3 with just the
*flatbuffers* replaced with the pre-3.6.x version (2.0.0).
This seemed to only require changes to the
version asserts and adding an *align* parameter to
*Table::VerifyField()* to match the newer API.
https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
Our system works correctly and passes all GDAL I/O tests
with that version. Obviously this isn't an ideal
solution, but this is otherwise a release blocker for us.
I would still very much like to discuss the original
problem more deeply, and hopefully come up with a better
solution.
Yours hopefully,
Simon
On Thu, Feb 22, 2024 at 10:22 PM Simon Eves
<simon.e...@heavy.ai> wrote:
Thank you, Robert, for the RR tip. I shall try it.
I have new findings to report, however.
First of all, I confirmed that a build against GDAL
3.4.1 (the version we were on before) still works. I
also confirmed that builds against 3.7.3 and 3.8.4
still failed even with no additional library
dependencies (just sqlite3 and proj), in case it was
a side-effect of us also adding more of those. I then
tried 3.5.3, with the CMake build (same config as we
use for 3.7.3) and that worked. I then tried 3.6.4
(again, same CMake config) and that failed. These
were all from bundles.
I then started delving through the GDAL repo itself.
I found the common root commit of 3.5.3 and 3.6.4,
and all the commits in the
*ogr/ogrsf_frmts/flatgeobuf* sub-project between that
one and the final of each. For 3.5.3, this was only
two. I built and tested both, and they were fine. I
then tried the very first one that was new in the
3.6.4 chain (not in the history of 3.5.3), which was
actually a bulk update to the *flatbuffers*
sub-library, committed by Bjorn Harrtell on May 8
2022 (SHA f7d8876). That one had the issue. I then
tried the immediately-preceding commit (an unrelated
docs change) and that one was fine.
My current hypothesis, therefore, is that the
*flatbuffers* update introduced the issue, or at
least, the susceptibility of the issue.
I still cannot explain why it only occurs in an
all-static build, and even less able to explain why
it only occurs in our full system and not with the
simple test program against the very same static lib
build that does the very same sequence of GDAL API
calls, but I repeated the build tests of the commits
either side and a few other random ones a bit further
away in each direction, and the results were
consistent. Again, it happens with both GCC 11 and
Clang 14 builds, Debug or Release.
I will continue tomorrow to look at the actual
changes to *flatbuffers* in that update, although
they are quite significant. Certainly, the
*vector_downward* class, which is directly involved,
was a new file in that update (although on inspection
of that file's history in the *google/flatbuffers*
repo, it seems it was just split out of another header).
Bjorn, I don't mean to call you out directly, but I
am CC'ing you to ensure you see this, as you appear
to be a significant contributor to the *flatbuffers*
project itself. Any insight you may have would be
very welcome. I am of course happy to describe my
debugging findings in more detail, privately if you
wish, rather than spamming the list.
Simon
On Tue, Feb 20, 2024 at 1:49 PM Robert Coup
<robert.c...@koordinates.com> wrote:
Hi,
On Tue, 20 Feb 2024 at 21:44, Robert Coup
<robert.c...@koordinates.com> wrote:
Hi Simon,
On Tue, 20 Feb 2024 at 21:11, Simon Eves
<simon.e...@heavy.ai> wrote:
Here's the stack trace for the original
assert. Something is stepping on scratch_
to make it 0x1000000000 instead of null,
which it starts out as when the
flatbuffer object is created, but by the
time it gets to allocating memory, it's
broken.
What happens if you set a watchpoint in gdb
when the flatbuffer is created?
watch -l myfb->scratch
or watch *0x1234c0ffee
Or I've also had success with Mozilla's rr:
https://rr-project.org/ — you can run to a point
where scratch is wrong, set a watchpoint on it,
and then run the program backwards to find out
what touched it.
Rob :)
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.