Hi Henson,
> Sorry, I forgot to attach the patch file in my previous email.
Thanks for the patch! Patch cleanly applied and regression test
passed here.
>> On a side note, I feel that the automata characteristics of RPR --
>> state transitions, context management, absorbability, etc. -- tend to
>> require a much larger volume of test cases than typical RDBMS/SQL
>> features. This seems unavoidable given the combinatorial nature of
>> NFA execution.
I agree. A complex feature requires complex tests. We cannot avoid it.
BTW, I have tested RPR with large partition/reduced frame (up to 100k
rows). Following query took over 3 minutes on my laptop.
EXPLAIN (ANALYZE)
SELECT aid, count(*) OVER w
FROM generate_series(1,100000) AS g(aid) WHERE aid <= 100000
WINDOW w AS (
ORDER BY aid
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP+)
DEFINE START AS TRUE, UP AS aid > PREV(aid)
);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=4337.40..4504.08 rows=33333 width=14) (actual
time=224171.589..224320.055 rows=100000.00 loops=1)
Window: w AS (ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
Pattern: start up+
Storage: Disk Maximum Storage: 4096kB
NFA States: 200000 peak, 5000050001 total, 0 merged
NFA Contexts: 100001 peak, 100001 total, 1 pruned
NFA: 1 matched (len 100000/100000/100000.0), 0 mismatched
NFA: 0 absorbed, 99998 skipped (len 2/99999/50000.5)
Buffers: shared hit=3, temp read=400968 written=391
-> Sort (cost=3754.09..3837.42 rows=33333 width=4) (actual
time=14.938..38.911 rows=100000.00 loops=1)
Sort Key: aid
Sort Method: quicksort Memory: 3073kB
Buffers: shared hit=3, temp read=171 written=171
-> Function Scan on generate_series g (cost=0.00..1250.00 rows=33333
width=4) (actual time=5.799..11.672 rows=100000.00 loops=1)
Filter: (aid <= 100000)
Buffers: temp read=171 written=171
Planning:
Buffers: shared hit=6 read=7
Planning Time: 0.185 ms
Execution Time: 224324.532 ms
It ran without eating all memory on my box (according to top command,
it was around 200MB). Great! However it took nearly 4 minutes to
complete the query. I think this is because there are 100k rows
partition/reduced frame, and RPR needs to keep 100k contexts plus 200k
states at the peak. I also ran "perf top" and took following profile.
19.87% postgres [.] nfa_match
18.92% postgres [.] nfa_advance_state
13.60% postgres [.] nfa_cleanup_dead_contexts
11.25% postgres [.] row_is_in_reduced_frame
11.06% postgres [.] nfa_state_clone
5.52% postgres [.] nfa_route_to_elem
4.39% postgres [.] nfa_add_state_unique.isra.0
4.32% postgres [.] nfa_state_alloc
2.36% libc.so.6 [.]
__memset_avx2_unaligned_erms
0.63% postgres [.] memset@plt
The reason for nfa_match is called so many times is, it's in a a loop
which iterate over all the contexts:
for (ctx = winstate->nfaContext; ctx != NULL; ctx = ctx->next)
I don't try to say we should enhance nfa modules in the case when
there is a large reduced frame, because I don't think this is a
realistic use case and I doubt it's worth to fight against unrealistic
scenario.
Note that even if the partition size is large, if we change the PATTERN to:
PATTERN (START UP{,3})
which generates only 4 rows reduced frames, we get very good
performance:
test=# test=# test=# EXPLAIN (ANALYZE)
SELECT aid, count(*) OVER w
FROM generate_series(1,100000) AS g(aid) WHERE aid <= 100000
WINDOW w AS (
ORDER BY aid
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP{,3})
DEFINE START AS TRUE, UP AS aid > PREV(aid)
);
---------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=4337.40..4504.08 rows=33333 width=14) (actual
time=36.586..91.988 rows=100000.00 loops=1)
Window: w AS (ORDER BY aid ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
Pattern: start up{0,3}
Storage: Memory Maximum Storage: 17kB
NFA States: 8 peak, 325001 total, 0 merged
NFA Contexts: 5 peak, 100001 total, 0 pruned
NFA: 25000 matched (len 4/4/4.0), 0 mismatched
NFA: 0 absorbed, 75000 skipped (len 1/3/2.0)
Buffers: temp read=171 written=171
-> Sort (cost=3754.09..3837.42 rows=33333 width=4) (actual
time=36.561..40.933 rows=100000.00 loops=1)
Sort Key: aid
Sort Method: quicksort Memory: 3073kB
Buffers: temp read=171 written=171
-> Function Scan on generate_series g (cost=0.00..1250.00 rows=33333
width=4) (actual time=14.297..28.692 rows=100000.00 loops=1)
Filter: (aid <= 100000)
Buffers: temp read=171 written=171
Planning Time: 0.087 ms
Execution Time: 94.810 ms
(18 rows)
I think this is more realistic use case than former. If I were
correct, we don't need to work on current nfa code to squeeze better
performance for unrealistic 100k reduced frame case. I maybe wrong
and I would love to hear from others especially, Henson, Vik, Jacob.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp