[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

Greg Hogan (JIRA) Wed, 11 May 2016 14:00:35 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280826#comment-15280826
 ]


Greg Hogan commented on FLINK-3879:
-----------------------------------

The current expectation for FLINK-2044 is that the user will have some 
intuition on how many iterations to run. If there is a specific best value then 
it should be hard-coded in the program; otherwise, we should provide guidance 
on how to select the number of iterations. I do not see discussion of a 
convergence threshold in the HITS paper, but borrowing this aspect of PageRank 
seems appropriate. A threshold of 1E-9 requires that the average score is 
changing by less than one part in a billion. That seems easier to conceptualize 
than setting the number of iterations.

I think a benchmark is always good for context when comparing implementations.

When does Scatter-Gather outperform GSA? I see that some algorithms cannot be 
implemented using GSA, but that is clearly not the case for these algorithms.

The only delta iteration I am familiar with is Stephan's implementation of 
PageRank. Are you thinking of a similar modification to HITS? It should be 
straightforward to implement this and compare techniques for HITS and PageRank. 
Convergence should not be affected, but vertex scores below a given threshold 
will stop propagating deltas.
  https://gist.github.com/StephanEwen/2b1a4c9812ac46abc8f0

> Native implementation of HITS algorithm
> ---------------------------------------
>
>                 Key: FLINK-3879
>                 URL: https://issues.apache.org/jira/browse/FLINK-3879
>             Project: Flink
>          Issue Type: New Feature
>          Components: Gelly
>    Affects Versions: 1.1.0
>            Reporter: Greg Hogan
>            Assignee: Greg Hogan
>             Fix For: 1.1.0
>
>
> Hyperlink-Induced Topic Search (HITS, also "hubs and authorities") is 
> presented in [0] and described in [1].
> "[HITS] is a very popular and effective algorithm to rank documents based on 
> the link information among a set of documents. The algorithm presumes that a 
> good hub is a document that points to many others, and a good authority is a 
> document that many documents point to." 
> [https://pdfs.semanticscholar.org/a8d7/c7a4c53a9102c4239356f9072ec62ca5e62f.pdf]
> This implementation differs from FLINK-2044 by providing for convergence, 
> outputting both hub and authority scores, and completing in half the number 
> of iterations.
> [0] http://www.cs.cornell.edu/home/kleinber/auth.pdf
> [1] https://en.wikipedia.org/wiki/HITS_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3879) Native implementation of HITS algorithm

Reply via email to