[ 
https://issues.apache.org/jira/browse/FLINK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Knauf updated FLINK-8041:
------------------------------------
      Labels:   (was: auto-closed)
    Priority: Not a Priority  (was: Minor)

> Recommended Improvements for Gelly's Connected Components Algorithm
> -------------------------------------------------------------------
>
>                 Key: FLINK-8041
>                 URL: https://issues.apache.org/jira/browse/FLINK-8041
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Graph Processing (Gelly)
>    Affects Versions: 1.3.2
>         Environment: Linux, IntelliJ IDEA
>            Reporter: Christos Hadjinikolis
>            Priority: Not a Priority
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> At the moment, the ConnectedComponents algorithm that comes with Flink's 
> native Graph API (Gelly) has two issues:
> 1. It relies on the user to provide correct values for in the vertices 
> DataSet. Based on how the algorithm works, these values must be of type Long 
> and be unique for every vertex. If the user provides the same values for 
> every vertex (e.g. 1) the algorithm still works but as those values are used 
> for the identification of the different connected components, one will end up 
> with a single connected component and will have no clue as to why this 
> happened. This can be easily fixed in two ways: either by checking that the 
> values that appear alongside vertex-ids are unique and informing the user if 
> not, or by generating those values for every vertex before the algorithm is 
> ran. I have a running implementation of the second way and I really think it 
> is an appropriate solution to this problem.
> 2. Once the connected components are identified, one has to apply additional 
> transformations and actions to find out which is the biggest or the order of 
> the connected components in terms of their size. Alternatively, the algorithm 
> can be implemented so that numerical ids that are given to each component 
> reflect their ranking when ordered based on size, e.g. connected component 1 
> will be the biggest, connected component 2 should be the second biggest and 
> so on.  I have also solved this and I think it would make a nice addition. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to