Re: [PR] [GLUTEN-9606][CH]Support CH MergeTree + Delta DeletionVector reading [incubator-gluten]

via GitHub Tue, 13 May 2025 01:46:56 -0700


CodiumAI-Agent commented on PR #9609:
URL: 
https://github.com/apache/incubator-gluten/pull/9609#issuecomment-2875586965


   ## PR Code Suggestions ✨
   
   <!-- 1a39d8a -->
   
   <table><thead><tr><td><strong>Category</strong></td><td 
align=left><strong>Suggestion&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; &nbsp; &nbsp; &nbsp; </strong></td><td 
align=center><strong>Impact</strong></td></tr><tbody><tr><td rowspan=2>Possible 
issue</td>
   <td>
   
   
   
   <details><summary>Loop over actual row count</summary>
   
   ___
   
   
   **<code>part_name_column_data.size()</code> reflects the cardinality 
dictionary size, not the <br>number of rows. Iterate up to 
<code>num_rows</code> so each position is handled correctly.**
   
   [cpp-ch/local-engine/Operator/FillingDeltaInternalRowDeletedStep.cpp 
[107-116]](https://github.com/apache/incubator-gluten/pull/9609/files#diff-f1fec2ab5d884a56f8b3bdfbe185d22c8237556fd1203a4dce2e6e738f97790fR107-R116)
   
   ```diff
   -for (size_t i = 0; i < part_name_column_data.size(); ++i)
   +for (size_t i = 0; i < num_rows; ++i)
    {
        std::string part_name = part_name_column_data.getDataAt(i).toString();
        if (dv_map.contains(part_name))
        {
            vec[i] = 
dv_map.at(part_name)->rb_contains(row_index_column_data.get64(i));
        }
        else
        {
            vec[i] = false;
        }
    }
   ```
   <details><summary>Suggestion importance[1-10]: 8</summary>
   
   __
   
   Why: Iterating to `part_name_column_data.size()` will use dictionary 
cardinality instead of `num_rows`, risking skipped rows. Using `num_rows` 
ensures each row is processed correctly.
   
   
   </details></details></td><td align=center>Medium
   
   </td></tr><tr><td>
   
   
   
   <details><summary>Prevent missing‐tag crashes</summary>
   
   ___
   
   
   **Accessing tags with <code>apply</code> may throw if a key is missing. Use 
<code>getOrElse</code> or pattern <br>matching to provide defaults or fail fast 
with a clear message.**
   
   
[backends-clickhouse/src-delta-23/main/scala/org/apache/spark/sql/execution/datasources/v2/clickhouse/metadata/AddFileTags.scala
 
[171-198]](https://github.com/apache/incubator-gluten/pull/9609/files#diff-b81a1776c3a0828e5e6e38757f6cfa76d7a2bfa99d744f6c3a2c16504ee7e79bR171-R198)
   
   ```diff
    new AddMergeTreeParts(
   -  addFile.tags("database"),
   -  addFile.tags("table"),
   -  addFile.tags("engine"),
   -  addFile.tags("path"),
   -  addFile.tags("targetNode"),
   +  addFile.tags.getOrElse("database", throw new 
IllegalArgumentException("missing database tag")),
   +  addFile.tags.getOrElse("table", throw new 
IllegalArgumentException("missing table tag")),
   +  addFile.tags.getOrElse("engine", "MergeTree"),
   +  addFile.tags.getOrElse("path", ""),
   +  addFile.tags.getOrElse("targetNode", ""),
      addFile.path,
   -  addFile.tags("uuid"),
   -  addFile.tags("rows").toLong,
   +  addFile.tags.getOrElse("uuid", ""),
   +  addFile.tags.get("rows").map(_.toLong).getOrElse(0L),
      addFile.size,
   -  addFile.tags("dataCompressedBytes").toLong,
   -  addFile.tags("dataUncompressedBytes").toLong,
   +  addFile.tags.get("dataCompressedBytes").map(_.toLong).getOrElse(0L),
   +  addFile.tags.get("dataUncompressedBytes").map(_.toLong).getOrElse(0L),
      addFile.modificationTime,
   -  addFile.tags("partitionId"),
   -  addFile.tags("minBlockNumber").toLong,
   -  addFile.tags("maxBlockNumber").toLong,
   -  addFile.tags("level").toInt,
   -  addFile.tags("dataVersion").toLong,
   -  addFile.tags("bucketNum"),
   -  addFile.tags("dirName"),
   +  addFile.tags.getOrElse("partitionId", ""),
   +  addFile.tags.get("minBlockNumber").map(_.toLong).getOrElse(0L),
   +  addFile.tags.get("maxBlockNumber").map(_.toLong).getOrElse(0L),
   +  addFile.tags.get("level").map(_.toInt).getOrElse(0),
   +  addFile.tags.get("dataVersion").map(_.toLong).getOrElse(0L),
   +  addFile.tags.getOrElse("bucketNum", ""),
   +  addFile.tags.getOrElse("dirName", ""),
      addFile.dataChange,
   -  addFile.tags("partition"),
   -  addFile.tags("defaultCompressionCodec"),
   +  addFile.tags.getOrElse("partition", ""),
   +  addFile.tags.getOrElse("defaultCompressionCodec", "LZ4"),
      addFile.stats,
      addFile.partitionValues,
   -  marks = addFile.tags("marks").toLong,
   +  marks = addFile.tags.get("marks").map(_.toLong).getOrElse(-1L),
      tags = addFile.tags,
      deletionVector = addFile.deletionVector
    )
   ```
   <details><summary>Suggestion importance[1-10]: 4</summary>
   
   __
   
   Why: This change uses safe lookups on `tags` to avoid runtime 
`NoSuchElementException`, but it may mask configuration errors by defaulting 
rather than failing fast. It offers moderate safety but could hide missing‐tag 
bugs.
   
   
   </details></details></td><td align=center>Low
   
   </td></tr></tr></tbody></table>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-9606][CH]Support CH MergeTree + Delta DeletionVector reading [incubator-gluten]

Reply via email to