acking-you commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824964842

   > Relevant: 
https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization
   
   Thank you so much for sharing this blog link—it’s truly an excellent 
learning resource! I hadn’t come across it before; I was just looking at the 
source code of this related optimization 
[PR](https://github.com/ClickHouse/ClickHouse/pull/55518) and running 
performance tests on the hits dataset. They introduced a new type called 
[ColumnLazy](https://github.com/wudidapaopao/ClickHouse/blob/0dc1c3fdba7c1a1fab932e87d6da06c17d49427e/src/Columns/ColumnLazy.h#L14-L31)
 to implement lazy materialization of columns. However, in my testing, I’ve 
noticed that performance starts to degrade when the `limit` value becomes 
large. For instance, degradation begins at `limit 100000` when the page cache 
is cleared. In fully cached scenarios, performance degradation occurs as early 
as `limit 256`. This is why they’ve set 
[query_plan_max_limit_for_lazy_materialization=10](https://github.com/wudidapaopao/ClickHouse/blob/0dc1c3fdba7c1a1fab932e87d6da06c17d49427e/src/Core/SettingsChangesHistory.c
 pp#L72) as the default value to determine whether to apply the optimization.
   
   However, I noticed that using the following SQL rewrite approach hardly 
leads to any performance degradation. I even tested it with `limit 400000` 
while clearing the page cache, and it still maintained excellent performance. 
Below are the performance test results for a limit of 100,000:
   ```sql
   -- Q1:
   SELECT * from hits ORDER BY "EventTime" LIMIT 100000;
   100000 rows in set. Elapsed: 70.314 sec. Processed 103.89 million rows, 
70.28 GB (1.48 million rows/s., 999.50 MB/s.)
   Peak memory usage: 27.66 GiB.
   -- Q2:
   SELECT * FROM hits WHERE (_part,_part_offset) in 
   (SELECT _part,_part_offset from hits ORDER BY "EventTime" LIMIT 100000);
   100000 rows in set. Elapsed: 5.639 sec. Processed 82.02 million rows, 3.39 
GB (14.55 million rows/s., 601.81 MB/s.)
   Peak memory usage: 715.57 MiB.
   ```
   The direct SQL rewrite approach mentioned above might still have some issues 
in ClickHouse at the moment. They are currently exploring the possibility of 
using it alongside projections (a feature in ClickHouse akin to materialized 
views) to create secondary indexes and similar functionalities. Relevant PR: 
https://github.com/ClickHouse/ClickHouse/pull/78429#issuecomment-2777130157.
   
   I find this truly remarkable, and I’m also investigating why the direct SQL 
rewrite performs faster than ColumnLazy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to