Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Greg Hogan Wed, 01 Mar 2017 09:02:26 -0800

On Fri, Feb 24, 2017 at 1:43 PM, Vasiliki Kalavri <vasilikikala...@gmail.com 
<mailto:vasilikikala...@gmail.com>> wrote:
Hi Greg,

On 24 February 2017 at 18:09, Greg Hogan <c...@greghogan.com 
<mailto:c...@greghogan.com>> wrote:

> Thanks, Vasia, for starting the discussion.
>
> I was expecting more changes from the recent discussion on restructuring
> the project, in particular regarding the libraries. Gelly has always
> collected algorithms and I have personally taken an algorithms-first
> approach for contributions. Is that manageable and maintainable? I'd prefer
> to see no limit to good contributions, and if necessary split the codebase
> or the project.

I don't think there should be a limit either. I do think though that
development should be community-driven, i.e. not making contributions just
for the sake of it, but evaluating their benefit first.
The library already has a quite long list of algorithms. Shall we keep on
extending it? And if yes, how do we choose which algorithms to add? Do we
accept any algorithm even if it hasn't been asked by anyone? So far, we've
added algorithms that we thought were useful and common. But continuing to
extend the library like this doesn't seem maintainable to me, because we
might end up with a lot of code to maintain that nobody uses. On the other
hand, adding more algorithms might attract more users, so I see a trade-off
there.

I only count 10 algorithms (with a few new in review). I’m not envisioning 
users thinking of their algorithm then scouring the web looking for a great 
implementation. I think Gelly reaches a larger audience as a diverse collection 
of well-implemented large scale algorithms on a distributed streaming 
processor. We do want to provide algorithms for which Flink is well-suited 
(parallel algorithms; large, sparse datasets; variable length results).

Flink has a stable API so maintenance should be minimized.

AdamicAdar
ClusteringCoefficient
CommunityDetection
ConnectedComponents
HITS
JaccardIndex
LabelPropagation
PageRank
SSSP
Summarization

> If so, then a secondary goal is to make the algorithms user-accessible and
> easier to review (especially at scale!). FLINK-4949 rewrites
> flink-gelly-examples with modular inputs and algorithms, allows users to
> run all existing algorithms, and makes it trivial to create a driver for
> new algorithms (and when comparing different implementations).

I'm +1 for anything that makes using existing functionality easier.
FLINK-4949 sounds like a great addition. Could you maybe extend the JIRA
and/or PR description a bit? I understand the rationale but it would be
nice to have a high-level description of the changes and the new
functionality that the PR adds or the interfaces it modifies. Otherwise, it
will be difficult to review a PR with +5k line changes :)

I’ve broken this into sub-tasks. I was anticipating a chop-and-rebase review.

> Regarding BipartiteGraphs, without algorithms or ideas for algorithms it's
> not possible to review the structure of the open pull requests.

I'm not sure I understand this point. There was a design document and an
extensive discussion on this issue. Do you think we should revisit? Some
common algorithms for bipartitite graphs that I am aware of is SALSA for
recommendations and relevance search for anomaly detection.

Gelly supports both directed and undirected graphs using a single `Graph` 
class. Do these new bipartite algorithms require a new BipartiteGraph class or 
can they reuse just an Edge set? I’ve added some comments to the open PRs that 
would reduce the amount of code, and my concerns are only maintainability and 
interoperability.

> +1 to evaluating performance and promoting Flink!
>
> Gelly has two shepherds whereas CEP and ML share one committer. New
> algorithms in Gelly require new features in the Batch API (Gelly may also
> start doing streaming, we're cool kids, too)

^^

> so we need to find a process
> for snuffing ideas early and for the right balance in dependence on core
> committers' time. For example, reworking the iteration scheduler to allow
> for intermediate outputs and nested iterations. Can this feature be
> developed and reviewed within Gelly? Does it need the blessing of a Stephan
> or Fabian? I'd like to see contributors and committers less dependent on
> the core team and more autonomous.
>

What do you mean developed and reviewed "within Gelly"?
This feature would require changes in the batch iterations code and will
probably need to be proposed and reviewed as a FLIP, so it would need the
blessing of the community :)

Having someone who is more familiar with this part of the code help is of
course favorable, but I don't think it's absolutely necessary.

I don’t know that this under-the-hood improvement would necessitate a FLIP but 
I also do not yet understand the changes required. My thought was only that if 
this is a very important feature for Gelly, we could plan, develop, and do the 
initial reviews among the Gelly contributors. Another example would be 
FLINK-3695 which I initially envisioned adding ValueArray types to flink-core 
but now think would be better to integrate first into Gelly.

Re: [DISCUSS] Gelly planning for release 1.3 and roadmap

Reply via email to