acking-you commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2824964842
> Relevant: https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization Thank you so much for sharing this blog link—it’s truly an excellent learning resource! I hadn’t come across it before; I was just looking at the source code of this related optimization [PR](https://github.com/ClickHouse/ClickHouse/pull/55518) and running performance tests on the hits dataset. They introduced a new type called [ColumnLazy](https://github.com/wudidapaopao/ClickHouse/blob/0dc1c3fdba7c1a1fab932e87d6da06c17d49427e/src/Columns/ColumnLazy.h#L14-L31) to implement lazy materialization of columns. However, in my testing, I’ve noticed that performance starts to degrade when the `limit` value becomes large. For instance, degradation begins at `limit 100000` when the page cache is cleared. In fully cached scenarios, performance degradation occurs as early as `limit 256`. This is why they’ve set [query_plan_max_limit_for_lazy_materialization=10](https://github.com/wudidapaopao/ClickHouse/blob/0dc1c3fdba7c1a1fab932e87d6da06c17d49427e/src/Core/SettingsChangesHistory.c pp#L72) as the default value to determine whether to apply the optimization. However, I noticed that using the following SQL rewrite approach hardly leads to any performance degradation. I even tested it with `limit 400000` while clearing the page cache, and it still maintained excellent performance. Below are the performance test results for a limit of 100,000: ```sql -- Q1: SELECT * from hits ORDER BY "EventTime" LIMIT 100000; 100000 rows in set. Elapsed: 70.314 sec. Processed 103.89 million rows, 70.28 GB (1.48 million rows/s., 999.50 MB/s.) Peak memory usage: 27.66 GiB. -- Q2: SELECT * FROM hits WHERE (_part,_part_offset) in (SELECT _part,_part_offset from hits ORDER BY "EventTime" LIMIT 100000); 100000 rows in set. Elapsed: 5.639 sec. Processed 82.02 million rows, 3.39 GB (14.55 million rows/s., 601.81 MB/s.) Peak memory usage: 715.57 MiB. ``` The direct SQL rewrite approach mentioned above might still have some issues in ClickHouse at the moment. They are currently exploring the possibility of using it alongside projections (a feature in ClickHouse akin to materialized views) to create secondary indexes and similar functionalities. Relevant PR: https://github.com/ClickHouse/ClickHouse/pull/78429#issuecomment-2777130157. I find this truly remarkable, and I’m also investigating why the direct SQL rewrite performs faster than ColumnLazy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org