kaknikhil commented on code in PR #594: URL: https://github.com/apache/madlib/pull/594#discussion_r1114882309
########## src/ports/postgres/modules/graph/wcc.py_in: ########## @@ -47,14 +48,20 @@ from graph_utils import validate_output_and_summary_tables def validate_wcc_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, out_table_summary, - grouping_cols_list, module_name): + grouping_cols_list, warm_start, out_table_message, module_name): """ Function to validate input parameters for wcc """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, - out_table, module_name) - _assert(not table_exists(out_table_summary), - "Graph {module_name}: Output summary table already exists!".format(**locals())) + out_table, module_name, warm_start) + if not warm_start: + _assert(not table_exists(out_table_summary), Review Comment: Maybe we could also assert that the msg table doesn't exist for non warm start scenarios ? ########## src/ports/postgres/modules/graph/graph_utils.py_in: ########## @@ -74,14 +74,18 @@ def validate_output_and_summary_tables(model_out_table, module_name, "Graph WCC: Output table {0} already exists.".format(out_table)) def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, - out_table, func_name, **kwargs): + out_table, func_name, warm_start = False, **kwargs): """ Validates graph tables (vertex and edge) as well as the output table. """ _assert(out_table and out_table.strip().lower() not in ('null', ''), - "Graph {func_name}: Invalid output table name!".format(**locals())) - _assert(not table_exists(out_table), - "Graph {func_name}: Output table already exists!".format(**locals())) + "Graph {func_name}: Invalid output table name!".format(**locals())) + if warm_start: + _assert(table_exists(out_table), Review Comment: Shouldn't we also assert that the msg table exists ? ########## src/ports/postgres/modules/graph/wcc.py_in: ########## @@ -47,14 +48,20 @@ from graph_utils import validate_output_and_summary_tables def validate_wcc_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, out_table_summary, - grouping_cols_list, module_name): + grouping_cols_list, warm_start, out_table_message, module_name): """ Function to validate input parameters for wcc """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, - out_table, module_name) - _assert(not table_exists(out_table_summary), - "Graph {module_name}: Output summary table already exists!".format(**locals())) + out_table, module_name, warm_start) + if not warm_start: + _assert(not table_exists(out_table_summary), + "Graph {module_name}: Output summary table already exists!".format(**locals())) + else: + _assert(table_exists(out_table_summary), + "Graph {module_name}: Output summary table is missing for warm start!".format(**locals())) + _assert(table_exists(out_table_message), + "Graph {module_name}: Output message table is missing for warm start! Either the wcc was completed in the last run or the table got dropped/renamed.".format(**locals())) Review Comment: s/the wcc/wcc ########## src/ports/postgres/modules/graph/wcc.py_in: ########## @@ -211,6 +227,14 @@ def wcc(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, else: edge_inverse = edge_table + if warm_start: + out_table_sql = "" + msg_sql = "" + if vertex_type != "BIGINT[]" and vertex_id_in and vertex_id_in != 'id': Review Comment: We do the same check later again, I'm assuming we need both ? ########## src/ports/postgres/modules/graph/test/wcc.sql_in: ########## @@ -276,3 +276,12 @@ SELECT graph_wcc_num_cpts( SELECT assert(relative_error(num_components, 3) < 0.00001, 'Weakly Connected Components: Incorrect largest component value.' ) FROM count_table WHERE user_id1=1; + +DROP TABLE IF EXISTS wcc_warm_start_out, wcc_warm_start_out_summary; +SELECT weakly_connected_components('v2',NULL,'e2',NULL,'wcc_warm_start_out', 'user_id', 2); Review Comment: 1. Can there be any issues if the user upgrades madlib and then uses the new function with warm start set to true on the previous out tables ? We might have to explicitly call this out as well 2. We should also add a test case for this scenario (This is based on the assumption that it takes 5 iterations to update all the nodes) * Run 2 iterations without warm start * Run 2 more with warm start set to true * Run 1 or 2 more with warm start set to true. For the first two runs, we should at the very least assert that nodes_to_update > 0 and for the final run, we should assert the contents of the out table and also that nodes_to_update = 0 3. There are still tables leftover after each warm start wcc run(with or without grouping). I think it might be a good idea to add a test to make sure that there aren't any extra tables left behind 4. Another scenario to consider: * Let's say it takes 10 iterations to update all the nodes in the graph * User runs wcc with iteration limit of 2 * User runs wcc with iteration limit of 3 and warm start * User runs wcc with iteration limit of 3 and warm start but removes the grouping col What is the ideal behavior here ? (same applies if we start with no grouping col and then start using one before all the nodes have been updated) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org