[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

jingyimei Wed, 15 Nov 2017 00:56:48 -0800

Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/195#discussion_r151064800
  
    --- Diff: src/ports/postgres/modules/graph/hits.py_in ---
    @@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
                                 FROM {1}
                             """.format(vertex_id, vertex_table))[0]["cnt"]
     
    -        # Assign default threshold value based on number of nodes in the 
graph.
             if threshold is None:
    -            threshold = 1.0 / (n_vertices * 1000)
    +            threshold = 
get_default_threshold_for_centrality_measures(n_vertices)
     
    +        # table/column names used when grouping_cols is set.
             edge_temp_table = unique_string(desp='temp_edge')
    +        grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
             distribution = ('' if is_platform_pg() else
    -                        "DISTRIBUTED BY ({0})".format(dest))
    -        plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table))
    +                        "DISTRIBUTED BY ({0}{1})".format(
    +                            grouping_cols_comma, dest))
    +        drop_tables([edge_temp_table])
             plpy.execute("""
                     CREATE TEMP TABLE {edge_temp_table} AS
                     SELECT * FROM {edge_table}
                     {distribution}
                 """.format(**locals()))
     
    -        # GPDB and HAWQ have distributed by clauses to help them with 
indexing.
    -        # For Postgres we add the index explicitly.
    -        if is_platform_pg():
    -            plpy.execute("CREATE INDEX ON {0}({1})".format(
    -                edge_temp_table, dest))
    +        
######################################################################
    +        # Set initial authority_norm and hub_norm as 1, so that later the 
final
    +        # norm should be positive number
    +        authority_init_value = 1.0
    +        hub_init_value = 1.0
    +
    +        subquery1 = unique_string(desp='subquery1')
    +
    +        distinct_grp_table = ''
    +        select_grouping_cols_comma = ''
    +        select_subquery1_grouping_cols_comma = ''
    --- End diff --
    
    I understood the incentive to rename those variables to make them more 
meaningful, however, seems this doesn't help simplify/reduce code but extends 
the length of many lines. When I read the code I still feel hard to keep track 
of them and understand the query. In this case I feel a shorter variable name 
is even simpler to read.

---

[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

Reply via email to