SemyonSinchenko commented on code in PR #448:
URL: https://github.com/apache/incubator-graphar/pull/448#discussion_r1560586502
##########
docs/libraries/pyspark/how-to.md:
##########
@@ -0,0 +1,212 @@
+---
+id: how-to
+title: How to use GraphAr PySpark package
+sidebar_position: 1
+---
+
+
+## GraphAr PySpark
+
+``graphar_pyspark`` is implemented as bindings to GraphAr spark scala
+library. You should have ``graphar-0.1.0-SNAPSHOT.jar`` in your
+Apache Spark JVM classpath. Otherwise you will get an exception. To
+add it spceify ``config("spark.jars", "path-to-graphar-jar")`` when
+you create a SparkSession:
+
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+ SparkSession
+ .builder
+ .master("local[1]")
+ .appName("graphar-local-tests")
+ .config("spark.jars", "../../spark/target/graphar-0.1.0-SNAPSHOT.jar")
+ .config("spark.log.level", "INFO")
+ .getOrCreate()
+)
+```
+
+
+ .. rubric:: GraphAr PySpark initialize
+ :name: graphar-pyspark-initialize
+
+## GraphAr PySpark initialize
+
+PySpark bindings are heavily relying on JVM-calls via ``py4j``. To
+initiate all the neccessary things for it just call
+``graphar_pyspark.initialize()``:
+
+```python
+from graphar_pyspark import initialize
+
+initialize(spark)
+```
+
+## GraphAr objects
+
+Now you can import, create and modify all the classes you can work
Review Comment:
```suggestion
Now you can import, create and modify all the classes you can
```
##########
docs/libraries/pyspark/how-to.md:
##########
@@ -0,0 +1,212 @@
+---
+id: how-to
+title: How to use GraphAr PySpark package
+sidebar_position: 1
+---
+
+
+## GraphAr PySpark
+
+``graphar_pyspark`` is implemented as bindings to GraphAr spark scala
+library. You should have ``graphar-0.1.0-SNAPSHOT.jar`` in your
+Apache Spark JVM classpath. Otherwise you will get an exception. To
+add it spceify ``config("spark.jars", "path-to-graphar-jar")`` when
+you create a SparkSession:
+
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+ SparkSession
+ .builder
+ .master("local[1]")
+ .appName("graphar-local-tests")
+ .config("spark.jars", "../../spark/target/graphar-0.1.0-SNAPSHOT.jar")
+ .config("spark.log.level", "INFO")
+ .getOrCreate()
+)
+```
+
+
+ .. rubric:: GraphAr PySpark initialize
+ :name: graphar-pyspark-initialize
+
+## GraphAr PySpark initialize
+
+PySpark bindings are heavily relying on JVM-calls via ``py4j``. To
+initiate all the neccessary things for it just call
+``graphar_pyspark.initialize()``:
+
+```python
+from graphar_pyspark import initialize
+
+initialize(spark)
+```
+
+## GraphAr objects
+
+Now you can import, create and modify all the classes you can work
+call from [scala API of
GraphAr](https://graphar.apache.org/docs/libraries/spark).
+For simplify using of graphar from python constants, like GAR-types,
+supported file-types, etc. are placed in ``graphar_pyspark.enums``.
+
+```python
+from graphar_pyspark.info import Property, PropertyGroup, AdjList,
AdjListType, VertexInfo, EdgeInfo, GraphInfo
+from graphar_pyspark.enums import GarType, FileType
+```
+
+Main objects of GraphAr are the following:
+
+- GraphInfo
+- VertexInfo
+- EdgeInfo
+
+You can check [Scala library
documentation](https://graphar.apache.org/docs/spark#information-classes)
Review Comment:
Is it possible to use relative paths instead? It would be better than full
links. Is it working in apache-like websites?
##########
docs/libraries/pyspark/how-to.md:
##########
@@ -0,0 +1,212 @@
+---
+id: how-to
+title: How to use GraphAr PySpark package
+sidebar_position: 1
+---
+
+
+## GraphAr PySpark
+
+``graphar_pyspark`` is implemented as bindings to GraphAr spark scala
+library. You should have ``graphar-0.1.0-SNAPSHOT.jar`` in your
+Apache Spark JVM classpath. Otherwise you will get an exception. To
+add it spceify ``config("spark.jars", "path-to-graphar-jar")`` when
+you create a SparkSession:
+
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+ SparkSession
+ .builder
+ .master("local[1]")
+ .appName("graphar-local-tests")
+ .config("spark.jars", "../../spark/target/graphar-0.1.0-SNAPSHOT.jar")
Review Comment:
```suggestion
.config("spark.jars",
"../../spark/graphar/target/graphar-0.1.0-SNAPSHOT.jar")
```
Because of submodules
##########
docs/libraries/pyspark/how-to.md:
##########
@@ -0,0 +1,212 @@
+---
+id: how-to
+title: How to use GraphAr PySpark package
+sidebar_position: 1
+---
+
+
+## GraphAr PySpark
+
+``graphar_pyspark`` is implemented as bindings to GraphAr spark scala
+library. You should have ``graphar-0.1.0-SNAPSHOT.jar`` in your
+Apache Spark JVM classpath. Otherwise you will get an exception. To
+add it spceify ``config("spark.jars", "path-to-graphar-jar")`` when
+you create a SparkSession:
+
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+ SparkSession
+ .builder
+ .master("local[1]")
+ .appName("graphar-local-tests")
+ .config("spark.jars", "../../spark/target/graphar-0.1.0-SNAPSHOT.jar")
+ .config("spark.log.level", "INFO")
+ .getOrCreate()
+)
+```
+
+
+ .. rubric:: GraphAr PySpark initialize
+ :name: graphar-pyspark-initialize
+
+## GraphAr PySpark initialize
+
+PySpark bindings are heavily relying on JVM-calls via ``py4j``. To
+initiate all the neccessary things for it just call
+``graphar_pyspark.initialize()``:
+
+```python
+from graphar_pyspark import initialize
+
+initialize(spark)
+```
+
+## GraphAr objects
+
+Now you can import, create and modify all the classes you can work
+call from [scala API of
GraphAr](https://graphar.apache.org/docs/libraries/spark).
+For simplify using of graphar from python constants, like GAR-types,
+supported file-types, etc. are placed in ``graphar_pyspark.enums``.
+
+```python
+from graphar_pyspark.info import Property, PropertyGroup, AdjList,
AdjListType, VertexInfo, EdgeInfo, GraphInfo
+from graphar_pyspark.enums import GarType, FileType
+```
+
+Main objects of GraphAr are the following:
+
+- GraphInfo
+- VertexInfo
+- EdgeInfo
+
+You can check [Scala library
documentation](https://graphar.apache.org/docs/spark#information-classes)
+for the more detailed information.
+
+
+## Creating objects in graphar_pyspark
+
+GraphAr PySpark package provide two main ways how to initiate
+objects, like ``GraphInfo``:
+
+#. ``from_python(**args)`` when you create an object based on
+ python-arguments
+#. ``from_scala(jvm_ref)`` when you create an object from the
+ corresponded JVM-object (``py4j.java_gateway.JavaObject``)
+
+
+```python
+help(Property.from_python)
+
+Help on method from_python in module graphar_pyspark.info:
+
+from_python(name: 'str', data_type: 'GarType', is_primary: 'bool') ->
'PropertyType' method of builtins.type instance
+ Create an instance of the Class from Python arguments.
+
+ :param name: property name
+ :param data_type: property data type
+ :param is_primary: flag that property is primary
+ :returns: instance of Python Class.
+```
+
+```python
+python_property = Property.from_python(name="my_property",
data_type=GarType.INT64, is_primary=False)
+print(type(python_property))
+
+<class 'graphar_pyspark.info.Property'>
+```
+
+You can always get a reference to the corresponding JVM object. For
+example, you want to use it in your own code and need a direct link
Review Comment:
```suggestion
example, if you want to use it in your own code and need a direct link
```
##########
docs/libraries/spark/spark.md:
##########
@@ -0,0 +1,235 @@
+---
+id: spark
+title: Spark Library
+sidebar_position: 3
+---
+
+## Overview
+
+The GraphAr Spark library is provided for generating, loading and transforming
GAR files with Apache Spark easy. It consists of several components:
+
+- **Information Classes**: As same with in the C++ library, the information
classes are implemented as a part of the Spark library for constructing and
accessing the meta information about the graphs, vertices and edges in GraphAr.
+- **IndexGenerator**: The IndexGenerator helps to generate the indices for
vertex/edge DataFrames. In most cases, IndexGenerator is first utilized to
generate the indices for a DataFrame (e.g., from primary keys), and then this
DataFrame can be written into GAR files through the writer.
+- **Writer**: The GraphAr Spark writer provides a set of interfaces that can
be used to write Spark DataFrames into GAR files. Every time it takes a
DataFrame as the logical table for a type of vertices or edges, assembles the
data in specified format (e.g., reorganize the edges in the CSR way) and then
dumps it to standard GAR files (CSV, ORC or Parquet files) under the specific
directory path.
+- **Reader**: The GraphAr Spark reader provides a set of interfaces that can
be used to read GAR files. It reads a collection of vertices or edges at a time
and assembles the result into the Spark DataFrame. Similar with the reader in
the C++ library, it supports the users to specify the data they need, e.g.,
reading a single property group instead of all properties.
+
+## Use Cases
+
+The GraphAr Spark library can be used in a range of scenarios:
+
+- Taking GAR as a data source to execute SQL queries or do graph processing
(e.g., using GraphX).
+- Transforming data between GAR and other data sources (e.g., Hive, Neo4j,
NebulaGraph, ...).
+- Transforming GAR data between different file types (e.g., from ORC to
Parquet).
+- Transforming GAR data between different adjList types (e.g., from COO to
CSR).
+- Modifying existing GAR data (e.g., adding new vertices/edges).
+
+For more information on its usage, please refer to the `Examples
<examples/spark.html>`_.
+
+
+## Get GraphAr Spark Library
+
+### Building from source
+
+Make the graphar-spark-library directory as the current working directory:
+
+```bash
+cd GraphAr/spark/
+```
+
+Compile package:
+
+```bash
+mvn clean package -DskipTests
+```
+
+GraphAr supports two Apache Spark versions for now and uses Maven Profiles to
work with it. The command above built GraphAr with Spark 3.2.2 by default. To
built GraphAr with Spark 3.3.4 use the following command:
+
+```bash
+mvn clean package -DskipTests -P datasources-33
+```
+
+After compilation, a similar file *graphar-x.x.x-SNAPSHOT-shaded.jar* is
generated in the directory *spark/graphar/target/*.
+
+Please refer to the `building steps
<https://github.com/apache/incubator-graphar/tree/main/spark>`_ for more
details.
Review Comment:
```suggestion
Please refer to the [building
steps](https://github.com/apache/incubator-graphar/tree/main/spark) for more
details.
```
##########
docs/libraries/pyspark/how-to.md:
##########
@@ -0,0 +1,212 @@
+---
+id: how-to
+title: How to use GraphAr PySpark package
+sidebar_position: 1
+---
+
+
+## GraphAr PySpark
+
+``graphar_pyspark`` is implemented as bindings to GraphAr spark scala
+library. You should have ``graphar-0.1.0-SNAPSHOT.jar`` in your
+Apache Spark JVM classpath. Otherwise you will get an exception. To
+add it spceify ``config("spark.jars", "path-to-graphar-jar")`` when
+you create a SparkSession:
+
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+ SparkSession
+ .builder
+ .master("local[1]")
+ .appName("graphar-local-tests")
+ .config("spark.jars", "../../spark/target/graphar-0.1.0-SNAPSHOT.jar")
+ .config("spark.log.level", "INFO")
+ .getOrCreate()
+)
+```
+
+
+ .. rubric:: GraphAr PySpark initialize
+ :name: graphar-pyspark-initialize
+
+## GraphAr PySpark initialize
+
+PySpark bindings are heavily relying on JVM-calls via ``py4j``. To
+initiate all the neccessary things for it just call
+``graphar_pyspark.initialize()``:
+
+```python
+from graphar_pyspark import initialize
+
+initialize(spark)
+```
+
+## GraphAr objects
+
+Now you can import, create and modify all the classes you can work
+call from [scala API of
GraphAr](https://graphar.apache.org/docs/libraries/spark).
+For simplify using of graphar from python constants, like GAR-types,
+supported file-types, etc. are placed in ``graphar_pyspark.enums``.
+
+```python
+from graphar_pyspark.info import Property, PropertyGroup, AdjList,
AdjListType, VertexInfo, EdgeInfo, GraphInfo
+from graphar_pyspark.enums import GarType, FileType
+```
+
+Main objects of GraphAr are the following:
+
+- GraphInfo
+- VertexInfo
+- EdgeInfo
+
+You can check [Scala library
documentation](https://graphar.apache.org/docs/spark#information-classes)
+for the more detailed information.
+
+
+## Creating objects in graphar_pyspark
+
+GraphAr PySpark package provide two main ways how to initiate
+objects, like ``GraphInfo``:
+
+#. ``from_python(**args)`` when you create an object based on
+ python-arguments
+#. ``from_scala(jvm_ref)`` when you create an object from the
+ corresponded JVM-object (``py4j.java_gateway.JavaObject``)
+
+
+```python
+help(Property.from_python)
+
+Help on method from_python in module graphar_pyspark.info:
+
+from_python(name: 'str', data_type: 'GarType', is_primary: 'bool') ->
'PropertyType' method of builtins.type instance
+ Create an instance of the Class from Python arguments.
+
+ :param name: property name
+ :param data_type: property data type
+ :param is_primary: flag that property is primary
+ :returns: instance of Python Class.
+```
+
+```python
+python_property = Property.from_python(name="my_property",
data_type=GarType.INT64, is_primary=False)
+print(type(python_property))
+
+<class 'graphar_pyspark.info.Property'>
+```
+
+You can always get a reference to the corresponding JVM object. For
+example, you want to use it in your own code and need a direct link
+to the underlaying instance of Scala Class, you can just call
+``to_scala()`` method:
+
+```python
+scala_obj = python_property.to_scala()
+print(type(scala_obj))
+
+<class 'py4j.java_gateway.JavaObject'>
+```
+
+As we already mentioned, you can initialize an instance of the Python
+class from the JVM object:
+
+```python
+help(Property.from_scala)
+
+Help on method from_scala in module graphar_pyspark.info:
+
+ from_scala(jvm_obj: 'JavaObject') -> 'PropertyType' method of builtins.type
instance
+ Create an instance of the Class from the corresponding JVM object.
+
+ :param jvm_obj: scala object in JVM.
+ :returns: instance of Python Class.
+```
+
+```python
+python_property = Property.from_scala(scala_obj)
+```
+
+Each public property and method of the Scala API is provided in
+python, but in a pythonic-naming convention. For example, in Scala,
+``Property`` has the following fields:
+
+- name
+- data_type
+- is_primary
+
+For each of such a field in Scala API there is a getter and setter
+methods. You can call them from the Python too:
+
+```python
+python_property.get_name()
+
+'my_property'
+```
+
+You can also modify fields, but be careful: when you modify field of
+instance of the Python class, you modify the underlaying Scala Object
+in the same moment!
Review Comment:
```suggestion
at the same moment!
```
##########
docs/libraries/spark/spark.md:
##########
@@ -0,0 +1,235 @@
+---
+id: spark
+title: Spark Library
+sidebar_position: 3
+---
+
+## Overview
+
+The GraphAr Spark library is provided for generating, loading and transforming
GAR files with Apache Spark easy. It consists of several components:
+
+- **Information Classes**: As same with in the C++ library, the information
classes are implemented as a part of the Spark library for constructing and
accessing the meta information about the graphs, vertices and edges in GraphAr.
+- **IndexGenerator**: The IndexGenerator helps to generate the indices for
vertex/edge DataFrames. In most cases, IndexGenerator is first utilized to
generate the indices for a DataFrame (e.g., from primary keys), and then this
DataFrame can be written into GAR files through the writer.
+- **Writer**: The GraphAr Spark writer provides a set of interfaces that can
be used to write Spark DataFrames into GAR files. Every time it takes a
DataFrame as the logical table for a type of vertices or edges, assembles the
data in specified format (e.g., reorganize the edges in the CSR way) and then
dumps it to standard GAR files (CSV, ORC or Parquet files) under the specific
directory path.
+- **Reader**: The GraphAr Spark reader provides a set of interfaces that can
be used to read GAR files. It reads a collection of vertices or edges at a time
and assembles the result into the Spark DataFrame. Similar with the reader in
the C++ library, it supports the users to specify the data they need, e.g.,
reading a single property group instead of all properties.
+
+## Use Cases
+
+The GraphAr Spark library can be used in a range of scenarios:
+
+- Taking GAR as a data source to execute SQL queries or do graph processing
(e.g., using GraphX).
+- Transforming data between GAR and other data sources (e.g., Hive, Neo4j,
NebulaGraph, ...).
+- Transforming GAR data between different file types (e.g., from ORC to
Parquet).
+- Transforming GAR data between different adjList types (e.g., from COO to
CSR).
+- Modifying existing GAR data (e.g., adding new vertices/edges).
+
+For more information on its usage, please refer to the `Examples
<examples/spark.html>`_.
Review Comment:
```suggestion
For more information on its usage, please refer to the
[Examples](examples/spark.md).
```
##########
docs/libraries/pyspark/pyspark.md:
##########
@@ -0,0 +1,107 @@
+---
+id: pyspark
+title: PySpark Library
+sidebar_position: 4
+---
+
+
+> **note:** The current policy of GraphAr project is that for Apache Spark
+> the main API is Scala Spark API. PySpark API follows scala Spark API.
+> Please refer to [GraphAr Spark library](../spark/spark.md)
+> for more detailed information about how to use GraphAr with Apache
+> Spark.
+
+## Overview
+
+The GraphAr PySpark library is provided for generating, loading and
+transforming GAR files with PySpark.
+
+- **Information Classes**: As same with in the C++ library, the
+ information classes are implemented as a part of the PySpark library
+ for constructing and accessing the meta information about the graphs,
+ vertices and edges in GraphAr.
+- **IndexGenerator**: The IndexGenerator helps to generate the indices
+ for vertex/edge DataFrames. In most cases, IndexGenerator is first
+ utilized to generate the indices for a DataFrame (e.g., from primary
+ keys), and then this DataFrame can be written into GAR files through
+ the writer.
+- **Writer**: The GraphAr PySpark writer provides a set of interfaces
+ that can be used to write Spark DataFrames into GAR files. Every time
+ it takes a DataFrame as the logical table for a type of vertices or
+ edges, assembles the data in specified format (e.g., reorganize the
+ edges in the CSR way) and then dumps it to standard GAR files (CSV,
+ ORC or Parquet files) under the specific directory path.
+- **Reader**: The GraphAr PySpark reader provides a set of interfaces
+ that can be used to read GAR files. It reads a collection of vertices
+ or edges at a time and assembles the result into the Spark DataFrame.
+ Similar with the reader in the C++ library, it supports the users to
+ specify the data they need, e.g., reading a single property group
+ instead of all properties.
+
+## Use Cases
+
+The GraphAr Spark library can be used in a range of scenarios:
+
+- Taking GAR as a data source to execute SQL queries or do graph
+ processing (e.g., using GraphX).
+- Transforming data between GAR and other data sources (e.g., Hive,
+ Neo4j, NebulaGraph, …).
+- Transforming GAR data between different file types (e.g., from ORC to
+ Parquet).
+- Transforming GAR data between different adjList types (e.g., from COO
+ to CSR).
+- Modifying existing GAR data (e.g., adding new vertices/edges).
+
+## Get GraphAr Spark Library
+
+### Building from source
+
+GraphAr PySpark uses poetry as a build system. Please refer to
+[Poetry documentation](https://python-poetry.org/docs/#installation)
+to find the manual how to install this tool. Currently GraphAr PySpark
+is build with Python 3.9 and PySpark 3.2
+
+
+Make the graphar-pyspark-library directory as the current working
+directory:
+
+```bash
+cd GraphAr/pyspark
+```
+
+Build package:
+
+
+```bash
+poetry build
+```
+
+After compilation, a similar file *graphar_pyspark-0.0.1.tar.gz* is
+generated in the directory *pyspark/dist/*.
+
+### Get from PyPI
+
+You cannot install graphar-pyspark from PyPi for now.
+
+
+## How to Use
+
+### Initialization
+
+GraphAr PySpark is not a standalone library but bindings to GraphAr
+Scala. You need to have *graphar-spark-x.x.x.jar* in your *spark-jars*.
+Please refer to `GraphAr scala documentation <../spark/index>`\_ to get
Review Comment:
It looks like RST syntax, not MD.
```suggestion
Please refer to [GraphAr scala documentation](../spark/spark.md) to get
```
##########
docs/libraries/pyspark/how-to.md:
##########
@@ -0,0 +1,212 @@
+---
+id: how-to
+title: How to use GraphAr PySpark package
+sidebar_position: 1
+---
+
+
+## GraphAr PySpark
+
+``graphar_pyspark`` is implemented as bindings to GraphAr spark scala
+library. You should have ``graphar-0.1.0-SNAPSHOT.jar`` in your
+Apache Spark JVM classpath. Otherwise you will get an exception. To
+add it spceify ``config("spark.jars", "path-to-graphar-jar")`` when
Review Comment:
```suggestion
add it specify ``config("spark.jars", "path-to-graphar-jar")`` when
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]