[GitHub] [age] panosfol commented on issue #995: Research functions that extracts label ID

via GitHub Sun, 16 Jul 2023 09:38:54 -0700


panosfol commented on issue #995:
URL: https://github.com/apache/age/issues/995#issuecomment-1637134723


   > **3. filter_vertices_on_label_id**
   > 
   > This function is used internally by the following query:
   > 
   > ```sql
   > MATCH (:Person)-[e:IN]->(:Title{primaryTitle:'Inception'}) RETURN e 
   > ```
   > 
   > The QPT is:
   > 
   > ```
   >  Gather  (cost=569028.74..853951.48 rows=105 width=32)
   >    Workers Planned: 2
   >    ->  Parallel Hash Join  (cost=568028.74..852940.98 rows=44 width=32)
   >          Hash Cond: (e.end_id = _age_default_alias_0.id)
   >          ->  Parallel Seq Scan on "IN" e  (cost=0.00..284748.30 rows=43610 
width=29)
   >                Filter: ((_extract_label_id(start_id))::integer = 4)
   >          ->  Parallel Hash  (cost=567965.30..567965.30 rows=5075 width=8)
   >                ->  Parallel Seq Scan on "Title" _age_default_alias_0  
(cost=0.00..567965.30 rows=5075 width=8)
   >                      Filter: (properties @> 
agtype_build_map('primaryTitle'::text, '"Inception"'::agtype))
   > ```
   > 
   > The function adds filter condition in a query plan. In the above QPT, this 
line `Filter: ((_extract_label_id(start_id))::integer = 4) ` is built by it.
   > 
   > Because person is filtered by only label (i.e. `(:Person)`) and no 
property filter or variable is used, internally the `Person` table is not 
joined with the `IN` table. The `_extract_label_id` can tell which label 
`start_id` belongs to, and eliminates the join.
   > 
   > In order to drop the concept of `graphid`, we will need to stop using the 
function `_extract_label_id`. One alternative is to actually to the join. 
Except, not with the `Person` table. A duplicate table of `Person` can be used. 
It can be trimmed to have only ID column and indexed strategically, to reduce 
the join time.
   > 
   > So, `Person` table will have a duplicate `Person_hash`. It will only have 
the ID column. Dropping the properties column will make the join faster since 
each disk read can load more rows now. The ID column, then, will be indexed by 
hash method, so a hash join can be performed. A combination of less data to 
load from disk and hash index, the join can be made faster than a regular join 
(which is performed from Title).
   > 
   > @panosfol @Zainab-Saad @WendelLana @CapnSpek What do you guys think of 
this solution? Can the other two functions that you researched adopt this 
solution?
   
   From my understanding this solution would need us know the `label_name` in 
order to find the correct duplicate table. Therefore I've researched in which 
of the above 3 functions (`entity_exists()`, `get_label_name()`, 
`filter_vertices_on_label_id()`) we have access to the `label_name` and in what 
context are they called.
   
   First, the `entity_exists()` function is called by `merge_vertex()` and 
`create_vertex()`, both of which have as argument `cypher_target_node` struct 
that has `char *label_name` as field, therefore we can actually access the 
`label_name` in the context that `entity_exists()` is being called.
   
   The `filter_vertices_on_label_id()` has `char *label` as argument, so we 
actually have the `label_name` in the context of the function. And finally the 
`get_label_name()` is being called by the `age_startnode()` and `age_endnode()` 
in the executor stage and it doesn't have a way to access the `label_name` 
without using the `graphid`. The problem is that the `age_startnode()` and 
`age_endnode()` are not internal C functions and their usage is to take an 
`edge` and return either the start node or the end node. The only way to 
accomplish that right now is through the node's `ID`, because that's the only 
information that the `edge` is holding for its 2 vertices.
   
   In order to remove completely the `graphid` utility we need to come up with 
a way specifically for the `edge` structure because the `get_label_name()` 
function is being called by the 2 functions (`age_startnode()`, 
`age_endnode()`) that have only an edge as argument. Or change/remove those 2 
functions completely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [age] panosfol commented on issue #995: Research functions that extracts label ID

Reply via email to