Re: [I] Major Performance Difference: SQL vs. Cypher for Aggregation/Ordering [age]

via GitHub Tue, 22 Jul 2025 11:05:00 -0700


serdarmicrosoft commented on issue #2194:
URL: https://github.com/apache/age/issues/2194#issuecomment-3104155140


   Sorry @jrgemignani I missed your question.
   
   Here is the output of the queries:
   
   EXPLAIN select count(*), u.id from "User" u, "HAS_INTERACTION" h, "Book" b 
where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 
10;
   
   
   `QUERY PLAN                                             
   
---------------------------------------------------------------------------------------------------------------------------------------------------
    Limit  (cost=728449.72..728449.74 rows=10 width=16)
      ->  Sort  (cost=728449.72..730061.43 rows=644686 width=16)
            Sort Key: (count(*)) DESC
            ->  Finalize HashAggregate  (cost=695479.90..714518.28 rows=644686 
width=16)
                  Group Key: u.id
                  Planned Partitions: 16
                  ->  Gather  (cost=470126.84..638062.55 rows=1289372 width=16)
                        Workers Planned: 2
                        ->  Partial HashAggregate  (cost=469126.84..508125.35 
rows=644686 width=16)
                              Group Key: u.id
                              Planned Partitions: 8
                              ->  Parallel Hash Join  (cost=15770.68..299858.22 
rows=4166612 width=8)
                                    Hash Cond: (h.start_id = u.id)
                                    ->  Parallel Hash Join  
(cost=1875.90..219375.78 rows=4166612 width=8)
                                          Hash Cond: (h.end_id = b.id)
                                          ->  Parallel Seq Scan on 
"HAS_INTERACTION" h  (cost=0.00..184514.12 rows=4166612 width=16)
                                          ->  Parallel Hash  
(cost=1389.45..1389.45 rows=38916 width=8)
                                                ->  Parallel Index Only Scan 
using "Book_id_idx" on "Book" b  (cost=0.29..1389.45 rows=38916 width=8)
                                    ->  Parallel Hash  (cost=9487.05..9487.05 
rows=268619 width=8)
                                          ->  Parallel Index Only Scan using 
"User_id_idx" on "User" u  (cost=0.42..9487.05 rows=268619 width=8)
   (20 rows)`
   
   SELECT * FROM cypher('goodreads_graph', $$ EXPLAIN MATCH 
(u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) 
DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
   
   `QUERY PLAN                           
   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Limit  (cost=34254481.74..34254481.76 rows=10 width=64)
      ->  Sort  (cost=34254481.74..34256093.45 rows=644686 width=64)
            Sort Key: ((count(*))::agtype) DESC
            ->  GroupAggregate  (cost=1109643.97..34240550.31 rows=644686 
width=64)
                  Group Key: agtype_access_operator(VARIADIC 
ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), 
u.properties), '"user_id"'::agtype])
                  ->  Nested Loop  (cost=1109643.97..21754040.28 
rows=2494723262 width=32)
                        ->  Gather Merge  (cost=1109643.96..2274293.50 
rows=9999869 width=118)
                              Workers Planned: 2
                              ->  Sort  (cost=1108643.94..1119060.47 
rows=4166612 width=118)
                                    Sort Key: (agtype_access_operator(VARIADIC 
ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), 
u.properties), '"user_id"'::agtype]))
                                    ->  Parallel Hash Join  
(cost=21655.93..284305.72 rows=4166612 width=118)
                                          Hash Cond: 
(_age_default_alias_0.start_id = u.id)
                                          ->  Parallel Seq Scan on 
"HAS_INTERACTION" _age_default_alias_0  (cost=0.00..184514.12 rows=4166612 
width=16)
                                          ->  Parallel Hash  
(cost=13838.19..13838.19 rows=268619 width=110)
                                                ->  Parallel Seq Scan on "User" 
u  (cost=0.00..13838.19 rows=268619 width=110)
                        ->  Memoize  (cost=0.01..1.54 rows=5 width=8)
                              Cache Key: _age_default_alias_0.end_id
                              Cache Mode: logical
                              ->  Append  (cost=0.00..1.53 rows=5 width=8)
                                    ->  Seq Scan on _ag_label_vertex 
_age_default_alias_1_1  (cost=0.00..0.00 rows=1 width=8)
                                          Filter: (id = 
_age_default_alias_0.end_id)
                                    ->  Index Only Scan using "User_id_idx" on 
"User" _age_default_alias_1_2  (cost=0.42..0.44 rows=1 width=8)
                                          Index Cond: (id = 
_age_default_alias_0.end_id)
                                    ->  Index Only Scan using "Book_id_idx" on 
"Book" _age_default_alias_1_3  (cost=0.29..0.31 rows=1 width=8)
                                          Index Cond: (id = 
_age_default_alias_0.end_id)
                                    ->  Index Only Scan using "Author_id_idx" 
on "Author" _age_default_alias_1_4  (cost=0.29..0.31 rows=1 width=8)
                                          Index Cond: (id = 
_age_default_alias_0.end_id)
                                    ->  Index Only Scan using "Review_id_idx" 
on "Review" _age_default_alias_1_5  (cost=0.43..0.45 rows=1 width=8)
                                          Index Cond: (id = 
_age_default_alias_0.end_id)
   (29 rows)`
   
   Here is the indexes I used:
   ```sql
   -- Index on 'id' for all vertex labels
   CREATE INDEX IF NOT EXISTS idx_user_id ON goodreads_graph."User" USING BTREE 
(id);
   CREATE INDEX IF NOT EXISTS idx_review_id ON goodreads_graph."Review" USING 
BTREE (id);
   CREATE INDEX IF NOT EXISTS idx_book_id ON goodreads_graph."Book" USING BTREE 
(id);
   CREATE INDEX IF NOT EXISTS idx_author_id ON goodreads_graph."Author" USING 
BTREE (id);
   
   CREATE INDEX ON goodreads_graph."User" USING GIN (properties);
   CREATE INDEX ON goodreads_graph."Review" USING GIN (properties);
   CREATE INDEX ON goodreads_graph."Book" USING GIN (properties);
   CREATE INDEX ON goodreads_graph."Author" USING GIN (properties);
   
   -- Index for WHERE clause: u.user_id = '...'
   CREATE INDEX IF NOT EXISTS idx_user_userid_op ON goodreads_graph."User" 
USING BTREE (agtype_access_operator(VARIADIC ARRAY[properties, 
'"user_id"'::agtype]));
   
   -- Index for WHERE clause: b.book_id = '...'
   CREATE INDEX IF NOT EXISTS idx_book_bookid_op ON goodreads_graph."Book" 
USING BTREE (agtype_access_operator(VARIADIC ARRAY[properties, 
'"book_id"'::agtype]));
   
   -- WRITTEN_BY edge: Review -[:WRITTEN_BY]-> User
   CREATE INDEX IF NOT EXISTS idx_written_by_start ON 
goodreads_graph."WRITTEN_BY" USING BTREE (start_id);
   CREATE INDEX IF NOT EXISTS idx_written_by_end ON 
goodreads_graph."WRITTEN_BY" USING BTREE (end_id);
   
   -- HAS_REVIEW edge: Book -[:HAS_REVIEW]-> Review
   CREATE INDEX IF NOT EXISTS idx_has_review_start ON 
goodreads_graph."HAS_REVIEW" USING BTREE (start_id);
   CREATE INDEX IF NOT EXISTS idx_has_review_end ON 
goodreads_graph."HAS_REVIEW" USING BTREE (end_id);
   
   -- Index for WHERE clause: r.review_id = '...'
   CREATE INDEX IF NOT EXISTS idx_review_reviewid_op ON 
goodreads_graph."Review" USING BTREE (agtype_access_operator(VARIADIC 
ARRAY[properties, '"review_id"'::agtype]));
   
   -- AUTHORED_BY edge: Book -[:AUTHORED_BY]-> Author
   CREATE INDEX IF NOT EXISTS idx_authored_by_start ON 
goodreads_graph."AUTHORED_BY" USING BTREE (start_id);
   CREATE INDEX IF NOT EXISTS idx_authored_by_end ON 
goodreads_graph."AUTHORED_BY" USING BTREE (end_id);
   
   -- HAS_INTERACTION edge: User -[:HAS_INTERACTION]-> Book
   CREATE INDEX IF NOT EXISTS idx_has_interaction_start ON 
goodreads_graph."HAS_INTERACTION" USING BTREE (start_id);
   CREATE INDEX IF NOT EXISTS idx_has_interaction_end ON 
goodreads_graph."HAS_INTERACTION" USING BTREE (end_id);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@age.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Major Performance Difference: SQL vs. Cypher for Aggregation/Ordering [age]

Reply via email to