Re: [gdal-dev] Errors when reading large xlsx files

Even Rouault Tue, 29 Mar 2022 12:25:23 -0700


Le 29/03/2022 à 20:29, Dirk Vanden Boer a écrit :

> The effect will at least be to ignore any rows for which thismessage was raised - the function is unconditionally exited after theerror is raised, before a new feature is added to the current layer.
So do I understand correctly that for files containing roughly morethan 100000 lines, rows that contain more columns of data than thedetected headers are not readable?Because if that is the case I will be required to patch my gdalversion to not skip these lines.

Please file an issue about that at https://github.com/OSGeo/gdal/issues


Regards,
Dirk

On Tue, Mar 29, 2022 at 8:09 PM Daniel Evans<[email protected]> wrote:


    > does the error impact the returned data?

    The effect will at least be to ignore any rows for which this
    message was raised - the function is unconditionally exited after
    the error is raised, before a new feature is added to the current
    layer.

    > Is there a way to suppress this error without disabling the gdal
    log handling. My logs are flooded with these messages, modifying
    the xlsx files is not an option because there are many and they
    are supplied by clients and regularly updated.

    I suspect the only way is by providing GDAL with a custom error
    handler, which ignores this specific message and otherwise
    delegates back to CPLDefaultErrorHandler() (or prints to stderr
    itself).

    Regards,
    Daniel

    On Tue, 29 Mar 2022 at 09:20, Dirk Vanden Boer
    <[email protected]> wrote:

        Scanning through the file, it turns out 2 lines actually have
        a value in the eight column, that's why the column is present,
        it doesn't have a header for that column however.

        So I have 2 questions:
        - does the error impact the returned data?
        - Is there a way to suppress this error without disabling the
        gdal log handling. My logs are flooded with these messages,
        modifying the xlsx files is not an option because there are
        many and they are supplied by clients and regularly updated.

        Regards,
        Dirk

        On Tue, Mar 29, 2022 at 10:06 AM Daniel Evans
        <[email protected]> wrote:

            Hi Dirk,

            > I do notice when I open the file in excel and select
            everything, the eight column in the file is empty but also
            gets selected.

            It looks like that's the key here.

            The code you identified gets hit if GDAL encounters a row
            with more populated columns than the previous one, and if
            the product of (previous numbers of rows read) x (number
            of columns to be added) is too high (>100,000), GDAL gives
            the error you're getting. That functionality was added in
            commit 4f3f1fa [1], in response to an OSSFuzz
            vulnerability report noting that GDAL becomes very slow if
            an Excel file adds many extra columns after reading many
            rows already (presumably as it has to modify every feature
            already read). I think this is where Even would start
            pointing out that there's downsides to such automated
            security scanners, as the distinction between "it's just
            slow for large files" (>25s in the report) and "an actual
            DOS attack" is awkward when dealing with typical GIS data
            volumes.

            Are you sure the 8th column contains no data at all? Even
            if it is empty, my experience is that Excel can be pretty
            stubborn about saving empty columns that have contained
            data at some point in the file's history. From memory,
            selecting the whole column, deleting it, and saving again
            usually convinces Excel to no longer save it.

            Regards,
            Daniel

            [1]
            
https://github.com/OSGeo/gdal/commit/4f3f1facc5da0eeac71f6b1ba946b7618386ee7d

            On Tue, 29 Mar 2022 at 08:41, Dirk Vanden Boer
            <[email protected]> wrote:

                Hi,

                When reading xlsx files that contains a lot of lines
                gdal reports the following error multiple times:
                | Adding too many columns to too many existing features

                It comes from the the xlsx driver:
                GIntBig nFeatureCount =
                poCurLayer->GetFeatureCount(false);
                if( nFeatureCount > 0 &&
                static_cast<size_t>(apoCurLineValues.size() -
                poCurLayer->GetLayerDefn()->GetFieldCount()) >
                            static_cast<size_t>(100000 / nFeatureCount) )
                {
                    CPLError(CE_Failure, CPLE_NotSupported,
                                "Adding too many columns to too many "
                                "existing features");
                    return;
                }

                The featureCount in my case is 128741
                apoCurLineValues.size() = 8
                fieldCount = 7

                Why is this error reported? Does it impact the actual
                read data?
                I do notice when I open the file in excel and select
                everything, the eight column in the file is empty but
                also gets selected.

                Kind regards,
                Dirk
                _______________________________________________
                gdal-dev mailing list
                [email protected]
                https://lists.osgeo.org/mailman/listinfo/gdal-dev


_______________________________________________
gdal-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/gdal-dev


--
http://www.spatialys.com
My software is free, but my time generally not.

_______________________________________________
gdal-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] Errors when reading large xlsx files

Reply via email to