2010YOUY01 opened a new pull request, #14142:
URL: https://github.com/apache/datafusion/pull/14142

   ## Which issue does this PR close?
   
   Part of https://github.com/apache/datafusion/issues/13431
   
   ## Rationale for this change
   
   Datafusion supports memory-limited queries: it's implemented by tracking 
internal memory consumption to limit the total memory usage.
   This feature needs to be verified externally: the profiled memory usage 
should be consistent with the specified limit.
   
   ### Idea
   Here is an example: compile and run datafusion-cli with memory limit, and 
profile the physical memory consumption:
   ```
   /usr/bin/time -l cargo run --release -- --mem-pool-type fair -m 400M -c 
'select * from generate_series(1,100000000) as t1(c1) order by c1'
   ```
   The source relation in the query in theory should consume 800M memory (int64 
* 100M), which can be checked with the same query without `order by`
   The ideal implementation of sorting uses O(N) space, so the query without 
memory limit should ideally use 800M + small memory for other internal data 
structures. If provided with a 400M memory limit, this query should run with 
around 400M physical memory.
   
   This test module is implementing this kind of validation. (And found the 
memory consumption of sorting is not ideal, it consumes 2X-3X memory or worse, 
I plan to investigate it later)
   
   ### Implementation
   Implementing such test is a bit tricky. The utility functions for measuring 
memory RSS can only get the current process's RSS, thus each test cases have to 
be run in a separate process, and rust will let tests in the same module run in 
the same process but in different threads.
   
   This PR uses the following workaround.
   ```
   #[test] 
   fn sort_mem_test_1() {
       // Return directly if environment variable 
`DATAFUSION_TEST_MEM_LIMIT_VALIDATION` is not set
       ....
   }
   
   #[test]
   fn test_runner() {
       // Set env var and execute command like 'cargo test sort_mem_test_1' to 
make sure all tests run in different processes
       ....
   }
   ```
   If a certain test case is run directly from 'cargo test', tests won't 
actually be runned. It uses a runner to be the actual entry point for all 
related tests.
   
   ## What changes are included in this PR?
   
   1. Added test utilities for memory limit validation (similar tests for 
external aggregate, join can be implemented later)
   2. Added tests for simple sort queries
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ## Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   3. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   ## Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to