Glynn Clements wrote:
lseek() always uses off_t. Originally it used long (hence the name
"l"seek), but that's ancient history; you won't find such a system
outside of a museum.

_FILE_OFFSET_BITS determines whether off_t is 32 or 64 bits. If it's
64 bits, many of the POSIX I/O functions (open, read, write, lseek)
are redirected to 64-bit equivalents (open64, read64, etc).
That's why I asked if fseek, fread, fwrite etc can be replaced with lseek, read, write etc :-) , no need to check HAVE_LARGEFILES with lseek etc, just compile with -D_FILE_OFFSET_BITS=64. Do I understand right that fseeko and ftello are only needed on 32-bit systems that want D_FILE_OFFSET_BITS=64? fseek e.g. returns long which is on my 64bit Linux 64bit, I guess that's why I can write coor files > 2GB with the current vector libs.
It's not worth using "raw" I/O just to avoid this issue. Apart from
anything else, there's a potentially huge performance hit, as the
vector library tends to use many small read/write operations. Using
low-level I/O requires a system call for each operation, while the
stdio interface will coalesce these, reading/writing whole blocks.
Interesting and good to know. So we do need G_fseek() and G_ftell()
The problem I see is that offset values are stored in topo and cidx (e.g. the topo file knows that line i is in the coor file at offset o). So if the topo file was written with 64-bit off_t but the current compiled library uses 32-bit off_t, can this 32-bit library somehow get these 64-bit offset values out of the topo file?

In the worst case, it can just perform 2 32-bit reads, and check that
the high word is zero and the low word is positive.
Uff. Some more safety checks in the code. From a coding perspective it's easier just to request a topology rebuild. Annoying for the user though. OTOH, that coor file size check is done before anything is read from the coor file, the libs could say something like "Sorry, that vector is too big for you. Please recompile GRASS with LFS" (more friendly phrasing needed). Also potentially annoying. But if the coor file size check is passed (<= 2GB), the high word must be always zero, otherwise it would refer to an offset beyond EOF. You could just use the low word value. Would you have to swap high word and low word if the byte order of the vector is different from the byte order of the current system? Can happen when e.g. a whole grass location is copied to another system. I think not because the vector libs use their own fixed byte order. I would really just request a topology rebuild to avoid all this hassle.
If the topo file contains any offsets which exceed the 2GiB range,
then the coor file will be larger than 2GiB. If you aren't using
_FILE_OFFSET_BITS=64, open()ing the coor file will likely fail.
Opening the coor file is not even attempted with the current code in this situation, because the coor file size stored in the topo header can not be larger than 2GB and this size is used for a safety check before opening the coor file. Actually, I don't know what would happen on a 32-bit system. If new vector libs are compiled without LFS, does a 32-bit system have a chance to find out that the coor file is too large? To be precise, when calling stat(path, &stat_buf), what would be the maximum possible value of stat_buf.st_size in 32-bit? Likely LONG_MAX.
OTOH, this amounts to a format change, so you may as well just add a
new field to the header. Either way, the version number needs to be
increased.
Increasing the minor version number of topo should be sufficient, but the backwards compatibility minor version number of topo must also be increased to enforce rebuilding of topology when vectors written with new libs are opened with old libs, that will write new topo and cidx files. I would try to keep the coor version numbers as they are, that would at least give backwards/forwards portability of vector files. cidx version numbers could stay unchanged, only that offset values could be stored as 64bit. But topo is read first, the information in the header of the topo file can (must?) be used for safety checks. I guess we are lost if someone produces a topo file >2GB, but a vector with such a large topo file would be a nightmare to work with anyway. No idea if this still holds true in say 5 years from now (I got max 600MB already, unworkable though because no LFS in vector libs and coor >2GB).

I think we could soon come up with a detailed plan of action: what are the currently known caveats, what should be done where in what order to get LFS into the vector libs. Anybody taking on this task would profit from such a guideline, with a big warning that the suggested changes may not be sufficient, that something may have been missed, and that the list of caveats is most likely not complete.

Lots of "if"s and "but"s and "?" in this post of mine.

PS: Thanks for your patience, Glynn.

_______________________________________________
grass-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/grass-dev

Reply via email to