[
https://issues.apache.org/jira/browse/FLINK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christos Hadjinikolis updated FLINK-8041:
-----------------------------------------
Description:
At the moment, the ConnectedComponents algorithm that comes with Flink's native
Graph API (Gelly) has two issues:
1. It relies on the user to provide correct values for in the vertices DataSet.
Based on how the algorithm works, these values must be of type Long and be
unique for every vertex. If the user provides the same values for every vertex
(e.g. 1) the algorithm still works but as those values are used for the
identification of the different connected components, one will end up with a
single connected component and will have no clue as to why this happened. This
can be easily fixed in two ways: either by checking that the values that appear
alongside vertex-ids are unique and informing the user if not, or by generating
those values for every vertex before the algorithm is ran. I have a running
implementation of the second way and I really think it is an appropriate
solution to this problem.
2. Once the connected components are identified, one has to apply additional
transformations and actions to find out which is the biggest or the order of
the connected components in terms of their size. Alternatively, the algorithm
can be implemented so that numerical ids that are given to each component
reflect their ranking when ordered based on size, e.g. connected component 1
will be the biggest, connected component 2 should be the second biggest and so
on. I have also solved this and I think it would make a nice addition.
was:
At the moment, the ConnectedComponents algorithm that comes with Flink's native
Graph API (Gelly) has two issues:
1. It relies on the user to provide correct values for in the vertices DataSet.
Based on how the algorithm works, these values must be of type Long and be
unique for every vertex. If the user provides the same values for every vertex
(e.g. 1) the algorithm still works but as those values are used for the
identification of the different connected components, one will end up with a
single connected component and will have no clue as to why this happened. This
can be easily fixed in two ways: either by checking that the values that appear
alongside vertex-ids are unique and informing the user if not, or by generating
those values for every vertex before the algorithm is ran. I have a running
implementation of the second way and I really think it is an appropriate
solution to this problem.
2. Once the connected components are identified, one has to apply additional
transformations and actions to find out which is the biggest or the order of
the connected components in terms of their size. Alternatively, the algorithm
can be implemented so that numerical ids that are given to each component
reflect their ranking when ordered based on size, e.g. connected component 1
will be the biggest, connected component 2 should be the second biggest and so
on.
> Recommended Improvements for Gelly's Connected Components Algorithm
> -------------------------------------------------------------------
>
> Key: FLINK-8041
> URL: https://issues.apache.org/jira/browse/FLINK-8041
> Project: Flink
> Issue Type: Improvement
> Components: Gelly
> Affects Versions: 1.3.2
> Environment: Linux, IntelliJ IDEA
> Reporter: Christos Hadjinikolis
> Priority: Minor
> Fix For: 1.4.0
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> At the moment, the ConnectedComponents algorithm that comes with Flink's
> native Graph API (Gelly) has two issues:
> 1. It relies on the user to provide correct values for in the vertices
> DataSet. Based on how the algorithm works, these values must be of type Long
> and be unique for every vertex. If the user provides the same values for
> every vertex (e.g. 1) the algorithm still works but as those values are used
> for the identification of the different connected components, one will end up
> with a single connected component and will have no clue as to why this
> happened. This can be easily fixed in two ways: either by checking that the
> values that appear alongside vertex-ids are unique and informing the user if
> not, or by generating those values for every vertex before the algorithm is
> ran. I have a running implementation of the second way and I really think it
> is an appropriate solution to this problem.
> 2. Once the connected components are identified, one has to apply additional
> transformations and actions to find out which is the biggest or the order of
> the connected components in terms of their size. Alternatively, the algorithm
> can be implemented so that numerical ids that are given to each component
> reflect their ranking when ordered based on size, e.g. connected component 1
> will be the biggest, connected component 2 should be the second biggest and
> so on. I have also solved this and I think it would make a nice addition.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)