[
https://issues.apache.org/jira/browse/MADLIB-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165416#comment-16165416
]
Frank McQuillan edited comment on MADLIB-1124 at 9/13/17 10:56 PM:
-------------------------------------------------------------------
I had a look at the PR and checked the following:
1) user doc examples work OK as shown
2) from
http://www.cis.hut.fi/Opinnot/T-61.6020/2008/pagerank_hits.pdf
I tried the toy example on slide 8
{code}
DROP TABLE IF EXISTS vertex, edge;
CREATE TABLE vertex(
id INTEGER
);
CREATE TABLE edge(
src INTEGER,
dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3);
INSERT INTO edge VALUES
(0, 1, 1),
(0, 2, 1),
(0, 3, 1),
(1, 2, 1),
(1, 3, 1),
(2, 1, 1);
SELECT * from edge ORDER BY src, dest;
{code}
produces
{code}
src | dest | user_id
-----+------+---------
0 | 1 | 1
0 | 2 | 1
0 | 3 | 1
1 | 2 | 1
1 | 3 | 1
2 | 1 | 1
(6 rows)
{code}
Run HITS
{code}
DROP TABLE IF EXISTS hits_out, hits_out_summary;
SELECT madlib.hits(
'vertex', -- Vertex table
'id', -- Vertex id column
'edge', -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out', -- Output table of HITS
100); -- Max iteration
SELECT * FROM hits_out ORDER BY id;
{code}
produces
{code}
id | authority | hub
----+-------------------+-------------------
0 | 0 | 0.788680749581252
1 | 0.459746429928187 | 0.577334927798041
2 | 0.627946343316548 | 0.211345821783211
3 | 0.627946343316548 | 0
(4 rows)
{code}
which matches the reference (see attached picture)
——————
Here are my comments on the user docs:
1) Please reference the original paper by Kleinburg in addition to Wikipedia.
2) Pls fix the note format under grouping_cols (missing yellow bar). See
PageRank to see what I mean.
3) Formatting issue below example 2, occurs 3 times with
__iterations__
4) out_table
TEXT. Name of the table to store the result of HITS. It will contain a row for
every vertex from 'vertex_table' with the following columns:
vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for
column naming.
auth : The vertex's Authority score.
hub : The vertex's Hub score.
but it seems column is called “authority” not “auth” so just change the docs to
match:
{code}
id authority hub
0 8.43871829095e-07 0.338306115082
1 0.158459587238 0.527865350448
2 0.40562796969 0.675800764727
3 0.721775835523 3.95111934817e-07
4 0.158459587238 3.95111934817e-07
5 0.316385413094 0.189719957843
6 0.405199928762 0.337944978189
{code}
5) Indicate that params are optional:
max_iter (optional)
threshold (optional)
was (Author: fmcquillan):
I had a look at the PR and checked the following:
1) user doc examples work OK as shown
2) from
http://www.cis.hut.fi/Opinnot/T-61.6020/2008/pagerank_hits.pdf
I tried the toy example on slide 8
{code}
DROP TABLE IF EXISTS vertex, edge;
CREATE TABLE vertex(
id INTEGER
);
CREATE TABLE edge(
src INTEGER,
dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3);
INSERT INTO edge VALUES
(0, 1, 1),
(0, 2, 1),
(0, 3, 1),
(1, 2, 1),
(1, 3, 1),
(2, 1, 1);
SELECT * from edge ORDER BY src, dest;
{code}
produces
{code}
src | dest | user_id
-----+------+---------
0 | 1 | 1
0 | 2 | 1
0 | 3 | 1
1 | 2 | 1
1 | 3 | 1
2 | 1 | 1
(6 rows)
{code}
Run HITS
{code}
DROP TABLE IF EXISTS hits_out, hits_out_summary;
SELECT madlib.hits(
'vertex', -- Vertex table
'id', -- Vertex id column
'edge', -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out', -- Output table of HITS
100); -- Max iteration
SELECT * FROM hits_out ORDER BY id;
{code}
produces
{code}
id | authority | hub
----+-------------------+-------------------
0 | 0 | 0.788680749581252
1 | 0.459746429928187 | 0.577334927798041
2 | 0.627946343316548 | 0.211345821783211
3 | 0.627946343316548 | 0
(4 rows)
{code}
which matches the reference
——————
Here are my comments on the user docs:
1) Please reference the original paper by Kleinburg in addition to Wikipedia.
2) Pls fix the note format under grouping_cols (missing yellow bar). See
PageRank to see what I mean.
3) Formatting issue below example 2, occurs 3 times with
__iterations__
4) out_table
TEXT. Name of the table to store the result of HITS. It will contain a row for
every vertex from 'vertex_table' with the following columns:
vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for
column naming.
auth : The vertex's Authority score.
hub : The vertex's Hub score.
but it seems column is called “authority” not “auth” so just change the docs to
match:
{code}
id authority hub
0 8.43871829095e-07 0.338306115082
1 0.158459587238 0.527865350448
2 0.40562796969 0.675800764727
3 0.721775835523 3.95111934817e-07
4 0.158459587238 3.95111934817e-07
5 0.316385413094 0.189719957843
6 0.405199928762 0.337944978189
{code}
5) Indicate that params are optional:
max_iter (optional)
threshold (optional)
> Graph - HITS algorithm
> ----------------------
>
> Key: MADLIB-1124
> URL: https://issues.apache.org/jira/browse/MADLIB-1124
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Graph
> Reporter: Frank McQuillan
> Assignee: Jingyi Mei
> Fix For: v2.0
>
> Attachments: pagerank_hits.png
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)