Re: [PR] [GLUTEN-8846][CH] [Part1] Support reading Iceberg equality delete files [incubator-gluten]

via GitHub Thu, 27 Feb 2025 18:22:39 -0800


CodiumAI-Agent commented on PR #8847:
URL: 
https://github.com/apache/incubator-gluten/pull/8847#issuecomment-2689554641


   ## PR Code Suggestions ✨
   
   <!-- b14cfe7 -->
   
   <table><thead><tr><td><strong>Category</strong></td><td 
align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td 
align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=5>Possible 
issue</td>
   <td>
   
   
   
   <details><summary>Proper filter index initialization</summary>
   
   ___
   
   
   **Avoid initializing a size_t index with -1; initialize it properly to 
prevent <br>unexpected overflow.**
   
   [cpp-ch/local-engine/Storages/SubstraitSource/iceberg/IcebergReader.cpp 
[112]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-b74de69bb5b8b743a6a74fafe0bbb47238b7a7c0cc10c00a3f9e30977a5922ffR112-R112)
   
   ```diff
   -size_t filter_column_position = -1;
   +size_t filter_column_position;
   +{
   +    Block block = readHeader.cloneWithColumns(columns);
   +    columns.clear();
   +    delete_expr->execute(block, num_rows_before_filtration);
   +    filter_column_position = 
block.getPositionByName(delete_expr_column_name);
   +}
   ```
   <details><summary>Suggestion importance[1-10]: 7</summary>
   
   __
   
   Why: Replacing "size_t filter_column_position = -1;" with an uninitialized 
declaration that is later assigned avoids undefined behavior due to unsigned 
overflow and makes the intent clearer.
   
   
   </details></details></td><td align=center>Medium
   
   </td></tr><tr><td>
   
   
   
   <details><summary>null check before use</summary>
   
   ___
   
   
   **Add a null-pointer check in the new <code>toString</code> function to 
guard against potential <br>dereference of a null <code>col</code>.**
   
   [cpp-ch/local-engine/Common/DebugUtils.cpp 
[146-165]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-cffe5919d2a79bef20429596b0b13b6fc348327b48b82c0a6f9915a3dc8f34daR146-R165)
   
   ```diff
    static std::string toString(const DB::IColumn * const col, size_t row, 
size_t width)
    {
   +    if (!col)
   +        return "null";
        auto getDataType = [](const DB::IColumn * col)
        {
            if (const auto * column_nullable = 
DB::checkAndGetColumn<DB::ColumnNullable>(col))
                return column_nullable->getNestedColumn().getDataType();
            …
        };
   ```
   <details><summary>Suggestion importance[1-10]: 5</summary>
   
   __
   
   Why: The suggestion adds a defensive null-pointer check to prevent potential 
crashes, which is useful though the overall impact is moderate since callers 
may already guarantee a non-null pointer.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr><tr><td>
   
   
   
   <details><summary>Implement destructor cleanup</summary>
   
   ___
   
   
   **Ensure that the destructor (~SubstraitFileSource) is properly implemented 
to release <br>resources and avoid potential memory leaks.**
   
   [cpp-ch/local-engine/Storages/SubstraitSource/SubstraitFileSource.h 
[36]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-f85e4088d49f425fe748c462e00f63e6d12280bac4420222fe79d7345a09fd36R36-R36)
   
   ```diff
   -~SubstraitFileSource() override;
   +~SubstraitFileSource() override { /* Implement proper resource cleanup */ }
   ```
   <details><summary>Suggestion importance[1-10]: 5</summary>
   
   __
   
   Why: Providing an implementation for the destructor to handle resource 
cleanup could prevent potential memory leaks, offering moderate improvement 
provided that the actual cleanup is necessary.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr><tr><td>
   
   
   
   <details><summary>Use unsigned loop indices</summary>
   
   ___
   
   
   **Change loop indices from int to size_t in the notEquals method to avoid 
<br>signed/unsigned mismatch and potential issues.**
   
   
[cpp-ch/local-engine/Storages/SubstraitSource/iceberg/EqualityDeleteFileReader.cpp
 
[127-133]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-f4625e8060e15b27210f8d23ce44c6e214e6433d2151100419b64483f42205e0R127-R133)
   
   ```diff
   -for (int i = 0; i < numDeletedValues; i++)
   +for (size_t i = 0; i < numDeletedValues; i++)
    {
   -    for (int j = 0; j < numDeleteFields; j++)
   +    for (size_t j = 0; j < numDeleteFields; j++)
        {
            // processing logic
        }
    }
   ```
   <details><summary>Suggestion importance[1-10]: 4</summary>
   
   __
   
   Why: Changing the loop index types from int to size_t prevents potential 
signed/unsigned mismatch; however, the impact is minor and addresses mostly a 
warning rather than a critical bug.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr><tr><td>
   
   
   
   <details><summary>Verify type mapping conversion</summary>
   
   ___
   
   
   **Verify that the updated lambda using toColumnType provides the correct 
type mapping <br>as expected compared to the previous use of toBlockFieldType.**
   
   [cpp-ch/local-engine/tests/utils/gluten_test_util.cpp 
[55]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-480ff6489e0a13d60940a91cec5a997336c321f2a00a06266c833c57029368ffR55-R55)
   
   ```diff
   -[](const auto & name_and_type) { return std::make_pair(name_and_type.name, 
toColumnType(name_and_type)); }
   +[](const auto & name_and_type) { return std::make_pair(name_and_type.name, 
toColumnType(name_and_type)); } // Confirm mapping semantics
   ```
   <details><summary>Suggestion importance[1-10]: 3</summary>
   
   __
   
   Why: The suggestion only adds a clarifying comment to confirm the semantics 
of the type mapping change without modifying functionality, resulting in 
minimal impact.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr><tr><td rowspan=3>General</td>
   <td>
   
   
   
   <details><summary>Use safer copy for vector</summary>
   
   ___
   
   
   **Consider replacing memcpy with std::copy or equivalent vector assignment 
to safely <br>copy values into the container.**
   
   [cpp-ch/local-engine/Common/BlockTypeUtils.h 
[164]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-40f748ebcab82f4b8807ea636b1127a51c3921fda8c0899b175b251da7b49002R164-R164)
   
   ```diff
   -memcpy(vec.data(), data.data(), data.size() * sizeof(T));
   +std::copy(data.begin(), data.end(), vec.begin());
   ```
   <details><summary>Suggestion importance[1-10]: 6</summary>
   
   __
   
   Why: Replacing memcpy with std::copy improves safety and readability when 
copying vector elements, which is a useful enhancement without altering 
functional behavior.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr><tr><td>
   
   
   
   <details><summary>robust numeric conversion</summary>
   
   ___
   
   
   **Enhance error handling in <code>buildFieldFromString</code> to validate 
numeric string inputs and <br>handle conversion failures explicitly.**
   
   [cpp-ch/local-engine/Storages/SubstraitSource/FileReader.cpp 
[114-200]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-c2f70aa7d4ef125510c3dee6e8980dedfc5f7b2c829c10c62549fd37ecbe5377R114-R200)
   
   ```diff
    DB::Field BaseReader::buildFieldFromString(const String & str_value, 
DB::DataTypePtr type)
    {
        …
        DB::ReadBufferFromString read_buffer(str_value);
        auto it = field_builders.find(nested_type->getName());
        if (it == field_builders.end())
        {
            …
            throw DB::Exception(DB::ErrorCodes::UNKNOWN_TYPE, "Unsupported data 
type {}", nested_type->getName());
        }
   -    return it->second(read_buffer, str_value);
   +    try
   +    {
   +        return it->second(read_buffer, str_value);
   +    }
   +    catch (const std::exception & e)
   +    {
   +        throw DB::Exception(DB::ErrorCodes::LOGICAL_ERROR, "Error 
converting '{}' for type {}: {}", str_value, nested_type->getName(), e.what());
   +    }
    }
   ```
   <details><summary>Suggestion importance[1-10]: 4</summary>
   
   __
   
   Why: Wrapping the conversion call in a try-catch improves error diagnostics 
for conversion failures, but the change is not critical and may be seen as an 
optional robustness enhancement.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr><tr><td>
   
   
   
   <details><summary>reset read buffer position</summary>
   
   ___
   
   
   **Ensure the state of the provided <code>read_buffer</code> is reset or 
validated after it is used <br>in <code>openInputParquetFile</code> to prevent 
unexpected read positions.**
   
   [cpp-ch/local-engine/Storages/Parquet/ParquetMeta.cpp 
[41-49]](https://github.com/apache/incubator-gluten/pull/8847/files#diff-c0ba20d2c22e4c0ce16127ae24fe17286344607211d7c361345f56e30001077eR41-R49)
   
   ```diff
    std::unique_ptr<parquet::ParquetFileReader> 
ParquetMetaBuilder::openInputParquetFile(DB::ReadBuffer & read_buffer)
    {
        const DB::FormatSettings format_settings{
            .seekable_read = true,
        };
        std::atomic<int> is_stopped{0};
        auto arrow_file = asArrowFile(read_buffer, format_settings, is_stopped, 
"Parquet", PARQUET_MAGIC_BYTES);
   +    // Reset the read buffer position if required.
   +    read_buffer.seek(0, SEEK_SET);
        return parquet::ParquetFileReader::Open(arrow_file, 
parquet::default_reader_properties(), nullptr);
    }
   ```
   <details><summary>Suggestion importance[1-10]: 3</summary>
   
   __
   
   Why: The suggestion to reset the read buffer position could prevent 
unexpected read positions; however, it may be unnecessary depending on 
downstream usage, making the improvement useful but of lower impact.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr></tr></tbody></table>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-8846][CH] [Part1] Support reading Iceberg equality delete files [incubator-gluten]

Reply via email to