Hello!

The current PySpark implementation has one serious problem: it won't
work with the new Spark Connect because it relies on using `py4j` and
an internal `_jvm` variable.

My suggestion is to rewrite PySpark API from scratch in the following
way:

1. We will have pure Python GraphInfo, EdgeInfo and VertexInfo
2. We will have pure PySpark utils (index generators)
3. We will use spark scala datasources for writing and reading in GAR
format from PySpark

It is a lot of work, but I'm committing to do it and support it in the
future as a PMC of the project. Decoupling PySpark from Scala will also
simplify Scala/Java development. Another good point is that the actual
logic in PySpark will be mostly in Python code that simplifies reading
of source code and debugging for everyone who wants to work with a
library.

Couple of additional dependencies will be introduced:
1. Pydantic for working with YAML-models of Info objects (MIT License,
pure python)
2. PyYaml  for the same reason (MIT License, pure python/cython)

Overall it should be good for the project, bceause it will simplify
testing of both part (spark and pyspark).

I see GraphAr PySpark mostly not in production ETL-jobs but in
interactive development and ad-hoc analytics on graph data. And
typically such an analytics happens on Databricks Notebooks (does not
provide an access to `_jvm` in shared clsuters) or in other tools (like
VSCode spark-connect) relies on Spark Connect. So, for that case
support of Spark Connect may be more important than for spark scala
part that should be used for jobs not interactive development.

Thoughts?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@graphar.apache.org
For additional commands, e-mail: dev-h...@graphar.apache.org

Reply via email to