Hi all,

I want to clarify my role first to avoid misunderstanding. I'm an
individual contributor here. My work on the graph SPIP as well as other
Spark features I contributed to are not associated with my employer. It
became quite challenging for me to keep track of the graph SPIP work due to
less available time at home.

On retrospective, we should have involved more Spark devs and committers
early on so there is no single point of failure, i.e., me. Hopefully it is
not too late to fix. I summarize my thoughts here to help onboard other
reviewers:

1. On the technical side, my main concern is the runtime dependency on
org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
came out with the solution to shade a few Scala libraries to avoid
pollution. However, I'm not super confident that the approach is
sustainable for two reasons: a) there exists no proper shading libraries
for Scala, 2) We will have to wait for upgrades from those Scala libraries
before we can upgrade Spark to use a newer Scala version. So it would be
great if some Scala experts can help review the current implementation and
help assess the risk.

2. Overloading helper methods. MLlib used to have several overloaded helper
methods for each algorithm, which later became a major maintenance burden.
Builders and setters/getters are more maintainable. I will comment again on
the PR.

3. The proposed API partitions graph into sub-graphs, as described in the
property graph model. It is unclear to me how it would affect query
performance because it requires SQL optimizer to correctly recognize data
from the same source and make execution efficient.

4. The feature, although originally targeted for Spark 3.0, should not be a
Spark 3.0 release blocker because it doesn't require breaking changes. If
we miss the code freeze deadline, we can introduce a build flag to exclude
the module from the official release/distribution, and then make it default
once the module is ready.

5. If unfortunately we still don't see sufficient committer reviews, I
think the best option would be submitting the work to Apache Incubator
instead to unblock the work. But maybe it is too earlier to discuss this
option.

It would be great if other committers can offer help on the review! Really
appreciated!

Best,
Xiangrui

On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg <m...@neo4j.org.invalid> wrote:

> Hello dear Spark community
>
> We are the developers behind the SparkGraph SPIP, which is a project
> created out of our work on openCypher Morpheus (
> https://github.com/opencypher/morpheus). During this year we have
> collaborated with mainly Xiangrui Meng of Databricks to define and develop
> a new SparkGraph module based on our experience from working on Morpheus.
> Morpheus - formerly known as "Cypher for Apache Spark" - has been in
> development for over 3 years and matured in its API and implementation.
>
> The SPIP work has been on hold for a period of time now, as priorities at
> Databricks have changed which has occupied Xiangrui's time (as well as
> other happenings). As you may know, the latest API PR (
> https://github.com/apache/spark/pull/24851) is blocking us from moving
> forward with the implementation.
>
> In an attempt to not lose track of this project we now reach out to you to
> ask whether there are any Spark committers in the community who would be
> prepared to commit to helping us review and merge our code contributions to
> Apache Spark? We are not asking for lots of direct development support, as
> we believe we have the implementation more or less completed already since
> early this year. There is a proof-of-concept PR (
> https://github.com/apache/spark/pull/24297) which contains the
> functionality.
>
> If you could offer such aid it would be greatly appreciated. None of us
> are Spark committers, which is hindering our ability to deliver this
> project in time for Spark 3.0.
>
> Sincerely
> the Neo4j Graph Analytics team
> Mats, Martin, Max, Sören, Jonatan
>
>

Reply via email to