liaoxin01 opened a new pull request, #60920:
URL: https://github.com/apache/doris/pull/60920

   ## Proposed changes
   
   Optimize stream load CSV read performance for nullable string columns by 
eliminating per-row overhead from the SerDe abstraction layer.
   
   ### Changes
   
   1. **Cache nullable string column pointers per-batch**: Pre-compute 
`assert_cast` results (ColumnStr and NullMap pointers) once per batch instead 
of once per row per column, stored in `NullableStringColumnCache`.
   
   2. **Inline nullable string write path**: Bypass 
`_deserialize_nullable_string` and `StringSerDe::deserialize_one_cell_from_csv` 
in the hot loop, directly performing null checks, escape handling, and 
`insert_data`/`push_back`.
   
   3. **Pre-reserve column capacity**: Reserve `offsets`, `chars`, and 
`null_map` capacity at batch start to reduce PODArray realloc overhead during 
the row loop.
   
   ### Performance
   
   Tested with ClickBench dataset stream load:
   - Import time reduced from 571s to 476s (**16.6% improvement**)
   - Compared to 2.1.7 baseline (650s), now **26.8% faster**
   
   ### Flame graph analysis
   
   Before optimization, `_deserialize_nullable_string` path dominated with +96s 
self-time from:
   - Per-row `assert_cast<ColumnNullable&>` (+65s)
   - `StringSerDe::deserialize_one_cell_from_csv` intermediate layer (+54s)
   - Repeated PODArray reserve/realloc during column growth
   
   After optimization, these costs are eliminated or amortized to per-batch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to