rafsun42 commented on issue #1021:
URL: https://github.com/apache/age/issues/1021#issuecomment-1634660959

   ## Approach 1 - Using a trimmed and indexed version of label table for join
   
   ### The query
   Cypher query that extracts label ID (see the QPT just below):
   
   ```sql
   SELECT * FROM cypher('imdb',
   $$
       EXPLAIN MATCH (p:Person{primaryName:'Christian 
Bale'})-[e:IS_IN]->(:Title) RETURN e
   $$) as (a agtype);
   ```
   
   ```txt
                                                     QUERY PLAN                 
                                  
   
---------------------------------------------------------------------------------------------------------------
    Gather  (cost=457881.41..742804.15 rows=105 width=32)
      Workers Planned: 2
      ->  Parallel Hash Join  (cost=456881.41..741793.65 rows=44 width=32)
            Hash Cond: (e.start_id = p.id)
            ->  Parallel Seq Scan on "IS_IN" e  (cost=0.00..284748.30 
rows=43610 width=29)
                  Filter: ((_extract_label_id(end_id))::integer = 3)
            ->  Parallel Hash  (cost=456815.59..456815.59 rows=5266 width=8)
                  ->  Parallel Seq Scan on "Person" p  (cost=0.00..456815.59 
rows=5266 width=8)
                        Filter: (properties @> 
agtype_build_map('primaryName'::text, '"Christian Bale"'::agtype))
   (9 rows)
   ```
   
   My goal is to replace the use of `_extract_label_id` in the Filter node. 
This line is filtering out edges where end node is not a `Title`.
   
   ### Building the solution that does not extract label ID
   My solution is to create a trimmed and indexed table of Title table. Calling 
it `Title_hash`. It has only ID column and indexed by hash method.
   ```sql
   CREATE SCHEMA imdb_dev;
   CREATE TABLE imdb_dev."Title_hash" (id graphid);
   CREATE INDEX imdb_dev_title_hash_id ON imdb_dev."Title_hash" USING hash (id);
   INSERT INTO imdb_dev."Title_hash" SELECT id FROM imdb."Title"; -- load the 
new table
   ```
   
   ### Query on the new solution
   The SQL query that uses the new table. It is equivalent to the above cypher 
query.
   ```sql
   EXPLAIN
   SELECT *
   FROM imdb."IS_IN" e
       JOIN imdb."Person" p
       ON e.start_id = p.id
   WHERE
       (p.properties @> agtype_build_map('primaryName'::text, '"Christian 
Bale"'::agtype))
       AND
       e.end_id IN (SELECT id FROM imdb_dev."Title_hash")
   ;
   ```
   ```txt
                                                        QUERY PLAN              
                                        
   
---------------------------------------------------------------------------------------------------------------------
    Gather  (cost=457881.41..745430.91 rows=20933 width=238)
      Workers Planned: 2
      ->  Nested Loop Semi Join  (cost=456881.41..742337.61 rows=8722 width=238)
            ->  Parallel Hash Join  (cost=456881.41..730763.53 rows=8722 
width=238)
                  Hash Cond: (e.start_id = p.id)
                  ->  Parallel Seq Scan on "IS_IN" e  (cost=0.00..241138.20 
rows=8722020 width=29)
                  ->  Parallel Hash  (cost=456815.59..456815.59 rows=5266 
width=209)
                        ->  Parallel Seq Scan on "Person" p  
(cost=0.00..456815.59 rows=5266 width=209)
                              Filter: (properties @> 
agtype_build_map('primaryName'::text, '"Christian Bale"'::agtype))
            ->  Index Scan using imdb_dev_title_hash_id on "Title_hash"  
(cost=0.00..1.32 rows=1 width=8)
                  Index Cond: (id = e.end_id)
   (11 rows)
   ```
   
   ### Rationale
   Because Title_hash has less data per row and it is indexed, joining it would 
be faster than joining with the original Title table. The cost of these query 
is almost similar.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to