Hello! The current PySpark implementation has one serious problem: it won't work with the new Spark Connect because it relies on using `py4j` and an internal `_jvm` variable.
My suggestion is to rewrite PySpark API from scratch in the following way: 1. We will have pure Python GraphInfo, EdgeInfo and VertexInfo 2. We will have pure PySpark utils (index generators) 3. We will use spark scala datasources for writing and reading in GAR format from PySpark It is a lot of work, but I'm committing to do it and support it in the future as a PMC of the project. Decoupling PySpark from Scala will also simplify Scala/Java development. Another good point is that the actual logic in PySpark will be mostly in Python code that simplifies reading of source code and debugging for everyone who wants to work with a library. Couple of additional dependencies will be introduced: 1. Pydantic for working with YAML-models of Info objects (MIT License, pure python) 2. PyYaml for the same reason (MIT License, pure python/cython) Overall it should be good for the project, bceause it will simplify testing of both part (spark and pyspark). I see GraphAr PySpark mostly not in production ETL-jobs but in interactive development and ad-hoc analytics on graph data. And typically such an analytics happens on Databricks Notebooks (does not provide an access to `_jvm` in shared clsuters) or in other tools (like VSCode spark-connect) relies on Spark Connect. So, for that case support of Spark Connect may be more important than for spark scala part that should be used for jobs not interactive development. Thoughts? --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@graphar.apache.org For additional commands, e-mail: dev-h...@graphar.apache.org