This is an automated email from the ASF dual-hosted git repository.
weibin pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-graphar.git
The following commit(s) were added to refs/heads/main by this push:
new 8bb741e3 feat(CI):enable markdownlint and typos in docs.yml (#508)
8bb741e3 is described below
commit 8bb741e3897019b5ed654923dc066053bbc93a5c
Author: teapot1de <[email protected]>
AuthorDate: Mon Aug 12 16:40:43 2024 +0800
feat(CI):enable markdownlint and typos in docs.yml (#508)
---
.github/workflows/docs.yml | 14 +++++++++--
docs/.markdownlint.yaml | 8 ++++++
docs/index.md | 5 +++-
docs/libraries/cpp/examples/graphscope.md | 2 +-
docs/libraries/cpp/examples/out-of-core.md | 1 -
docs/libraries/cpp/getting-started.md | 4 +--
docs/libraries/java/how_to_develop_java.md | 10 ++++----
docs/libraries/java/java.md | 39 ++++++++++++++---------------
docs/libraries/pyspark/how-to.md | 27 +++++++++-----------
docs/libraries/pyspark/pyspark.md | 4 ---
docs/libraries/spark/examples.md | 5 +---
docs/libraries/spark/spark.md | 13 +++-------
docs/overview/concepts.md | 12 ++++-----
docs/overview/motivation.md | 8 +++---
docs/overview/overview.md | 1 +
docs/specification/format.md | 21 ++++++++--------
docs/specification/implementation-status.md | 16 +++---------
licenserc.toml | 1 +
18 files changed, 94 insertions(+), 97 deletions(-)
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
index 60b7b97f..44639d93 100644
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -54,6 +54,18 @@ jobs:
with:
node-version: '18'
+ - name: Run markdownlint
+ run: |
+ npm install -g markdownlint-cli
+ markdownlint 'docs/**/*.md' --fix --config 'docs/.markdownlint.yaml'
+
+ - name: Run typos
+ run: |
+ curl -sSL
https://github.com/crate-ci/typos/releases/download/v1.23.6/typos-v1.23.6-x86_64-unknown-linux-musl.tar.gz
-o typos.tar.gz
+ tar -xzf typos.tar.gz
+ chmod +x typos
+ ./typos docs
+
- name: Checkout Website
uses: actions/checkout@v4
with:
@@ -74,5 +86,3 @@ jobs:
- name: Build
working-directory: website
run: pnpm build
-
-# TODO: enable markdownlint & typos
diff --git a/docs/.markdownlint.yaml b/docs/.markdownlint.yaml
new file mode 100644
index 00000000..c8432bf9
--- /dev/null
+++ b/docs/.markdownlint.yaml
@@ -0,0 +1,8 @@
+# Ignore MD013 because the document requires long lines to keep code examples
intact
+MD013: false
+
+# Ignore MD033 because inline HTML is necessary in some cases, such as
specific formatting needs
+MD033: false
+
+# Ignore MD025 because the document structure requires multiple top-level
headings to reflect different chapters or sections
+MD025: false
diff --git a/docs/index.md b/docs/index.md
index e033a6f4..37150953 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -8,10 +8,13 @@ sidebar_position: 0
Welcome to the documentation for Apache GraphAr. Here, you can find
information about the GraphAr File Format, including specification and
libraries.
### [Overview](/docs/overview)
+
Overview of the Apache GraphAr project.
### [Specification](/docs/category/specification)
+
Documentation about the Apache GraphAr file format.
### [Libraries](/docs/category/libraries)
-Documentation about the libraries of Apache GraphAr.
+
+Documentation about the libraries of Apache GraphAr.
diff --git a/docs/libraries/cpp/examples/graphscope.md
b/docs/libraries/cpp/examples/graphscope.md
index 8afcc96d..784f628e 100644
--- a/docs/libraries/cpp/examples/graphscope.md
+++ b/docs/libraries/cpp/examples/graphscope.md
@@ -30,7 +30,7 @@ The time performance of *ArrowFragmentBuilder* and
*ArrowFragmentWriter*
in GraphScope is heavily dependent on the partitioning of the graph into
GraphAr format files, that is, the *vertex chunk size* and *edge chunk size*,
which
are specified in the vertex information file and in the edge information
-file, respectively.
+file, respectively.
Generally speaking, fewer chunks are created if the file size is large.
On small graphs, this can be disadvantageous as it reduces the degree of
diff --git a/docs/libraries/cpp/examples/out-of-core.md
b/docs/libraries/cpp/examples/out-of-core.md
index 08cf509d..7b97426b 100644
--- a/docs/libraries/cpp/examples/out-of-core.md
+++ b/docs/libraries/cpp/examples/out-of-core.md
@@ -89,7 +89,6 @@ neighbors. Please refer to
[cc_push_example.cc](https://github.com/apache/incubator-graphar/blob/main/cpp/examples/cc_push_example.cc)
for the complete code.
-
:::tip
In this example, two kinds of edges are used. The
diff --git a/docs/libraries/cpp/getting-started.md
b/docs/libraries/cpp/getting-started.md
index 7265026d..cf93f75b 100644
--- a/docs/libraries/cpp/getting-started.md
+++ b/docs/libraries/cpp/getting-started.md
@@ -202,7 +202,7 @@ the above graph and outputs the end vertices for each edge.
```cpp
graph_info = ...
-auto expect = graphar::EdgesCollection::Make(graph_info, "person", "konws",
"person", graphar::AdjListType::ordered_by_source);
+auto expect = graphar::EdgesCollection::Make(graph_info, "person", "knows",
"person", graphar::AdjListType::ordered_by_source);
auto edges = expect.value();
for (auto it = edges->begin(); it != edges->end(); ++it) {
@@ -287,4 +287,4 @@ with URI schema, e.g., "s3://bucket-name/path/to/data" or
"s3://\[access-key:sec
[Code
example](https://github.com/apache/incubator-graphar/blob/main/cpp/test/test_info.cc#L777-L792)
demonstrates how to read data from S3.
-Note that once you use cloud storage, you need to call `graphar::InitalizeS3`
to initialize S3 APIs before starting the work and call`graphar::FinalizeS3()`
to shut down the APIs after the work finish.
+Note that once you use cloud storage, you need to call `graphar::InitializeS3`
to initialize S3 APIs before starting the work and call`graphar::FinalizeS3()`
to shut down the APIs after the work finish.
diff --git a/docs/libraries/java/how_to_develop_java.md
b/docs/libraries/java/how_to_develop_java.md
index 4a4aedcd..e279c91e 100644
--- a/docs/libraries/java/how_to_develop_java.md
+++ b/docs/libraries/java/how_to_develop_java.md
@@ -10,7 +10,7 @@ GraphAr Java library based on GraphAr C++ library and an
efficient FFI
for Java and C++ called
[FastFFI](https://github.com/alibaba/fastFFI).
-### Source Code Level
+### Source Code Level
- Interface
- Class
@@ -80,8 +80,8 @@ Please refer to
## How To Test
```bash
-$ export GAR_TEST_DATA=$PWD/../../testing/
-$ mvn clean test
+export GAR_TEST_DATA=$PWD/../../testing/
+mvn clean test
```
This will build GraphAr C++ library internally for Java. If you already
@@ -96,11 +96,11 @@ To ensure CI for checking code style will pass, please
ensure check
below is success:
```bash
-$ mvn spotless:check
+mvn spotless:check
```
If there are violations, running command below to automatically format:
```bash
-$ mvn spotless:apply
+mvn spotless:apply
```
diff --git a/docs/libraries/java/java.md b/docs/libraries/java/java.md
index 352a36a6..fdc90718 100644
--- a/docs/libraries/java/java.md
+++ b/docs/libraries/java/java.md
@@ -11,19 +11,19 @@ Based on an efficient FFI for Java and C++ called
library allows users to write Java for generating, loading and
transforming GraphAr format files. It consists of several components:
-- **Information Classes**: As same with in the C++ library, the
+- **Information Classes**: As same with in the C++ library, the
information classes are implemented to construct and access the meta
information about the **graphs**, **vertices** and **edges** in
GraphAr.
-- **Writers**: The GraphAr Java writer provides a set of interfaces
+- **Writers**: The GraphAr Java writer provides a set of interfaces
that can be used to write Apache Arrow VectorSchemaRoot into GraphAr format
files. Every time it takes a VectorSchemaRoot as the logical table
for a type of vertices or edges, then convert it to ArrowTable, and
then dumps it to standard GraphAr format files (CSV, ORC or Parquet files)
under
the specific directory path.
-- **Readers**: The GraphAr Java reader provides a set of interfaces
+- **Readers**: The GraphAr Java reader provides a set of interfaces
that can be used to read GraphAr format files. It reads a collection of
vertices
or edges at a time and assembles the result into the ArrowTable.
Similar with the reader in the C++ library, it supports the users to
@@ -41,41 +41,41 @@ Firstly, install llvm-11. `LLVM11_HOME` should point to the
home of
LLVM 11. In Ubuntu, it is at `/usr/lib/llvm-11`. Basically, the build
procedure the following binary:
-- `$LLVM11_HOME/bin/clang++`
-- `$LLVM11_HOME/bin/ld.lld`
-- `$LLVM11_HOME/lib/cmake/llvm`
+- `$LLVM11_HOME/bin/clang++`
+- `$LLVM11_HOME/bin/ld.lld`
+- `$LLVM11_HOME/lib/cmake/llvm`
Tips:
-- Use Ubuntu as example:
+- Use Ubuntu as example:
```bash
-$ sudo apt-get install llvm-11 clang-11 lld-11 libclang-11-dev libz-dev -y
-$ export LLVM11_HOME=/usr/lib/llvm-11
+sudo apt-get install llvm-11 clang-11 lld-11 libclang-11-dev libz-dev -y
+export LLVM11_HOME=/usr/lib/llvm-11
```
-- Or compile from source with this
[script](https://github.com/alibaba/fastFFI/blob/main/docker/install-llvm11.sh):
+- Or compile from source with this
[script](https://github.com/alibaba/fastFFI/blob/main/docker/install-llvm11.sh):
```bash
-$ export LLVM11_HOME=/usr/lib/llvm-11
-$ export LLVM_VAR=11.0.0
-$ sudo ./install-llvm11.sh
+export LLVM11_HOME=/usr/lib/llvm-11
+export LLVM_VAR=11.0.0
+sudo ./install-llvm11.sh
```
Make the graphar-java-library directory as the current working
directory:
```bash
-$ git clone https://github.com/apache/incubator-graphar.git
-$ cd incubator-graphar
-$ git submodule update --init
-$ cd maven-projects/java
+git clone https://github.com/apache/incubator-graphar.git
+cd incubator-graphar
+git submodule update --init
+cd maven-projects/java
```
Compile package:
```bash
-$ mvn clean install -DskipTests
+mvn clean install -DskipTests
```
This will build GraphAr C++ library internally for Java. If you already
installed GraphAr C++ library in your system,
@@ -83,7 +83,6 @@ you can append this option to skip: `-DbuildGarCPP=OFF`.
Then set GraphAr as a dependency in maven project:
-
```xml
<dependencies>
<dependency>
@@ -212,4 +211,4 @@ StdPair<Long, Long> range = reader.getRange().value();
See [test for
readers](https://github.com/apache/incubator-graphar/blob/main/maven-projects/java/src/test/java/org/apache/graphar/readers)
-for the complete example.
\ No newline at end of file
+for the complete example.
diff --git a/docs/libraries/pyspark/how-to.md b/docs/libraries/pyspark/how-to.md
index aebeb325..4e7e3a64 100644
--- a/docs/libraries/pyspark/how-to.md
+++ b/docs/libraries/pyspark/how-to.md
@@ -30,7 +30,7 @@ spark = (
## GraphAr PySpark initialize
PySpark bindings are heavily relying on JVM-calls via ``py4j``. To
-initiate all the neccessary things for it just call
+initiate all the necessary things for it just call
``graphar_pyspark.initialize()``:
```python
@@ -53,15 +53,14 @@ from graphar_pyspark.enums import GarType, FileType
Main objects of GraphAr are the following:
-- GraphInfo
-- VertexInfo
-- EdgeInfo
+- GraphInfo
+- VertexInfo
+- EdgeInfo
You can check [Scala library documentation](../spark/spark.md)
for the more detailed information.
-
-## Creating objects in graphar_pyspark
+## Creating objects in graphar_pyspark
GraphAr PySpark package provide two main ways how to initiate
objects, like ``GraphInfo``:
@@ -71,7 +70,6 @@ objects, like ``GraphInfo``:
- ``from_scala(jvm_ref)`` when you create an object from the
corresponded JVM-object (``py4j.java_gateway.JavaObject``)
-
```python
help(Property.from_python)
@@ -95,7 +93,7 @@ print(type(python_property))
You can always get a reference to the corresponding JVM object. For
example, if you want to use it in your own code and need a direct link
-to the underlaying instance of Scala Class, you can just call
+to the underlying instance of Scala Class, you can just call
``to_scala()`` method:
```python
@@ -128,9 +126,9 @@ Each public property and method of the Scala API is
provided in
python, but in a pythonic-naming convention. For example, in Scala,
``Property`` has the following fields:
-- name
-- data_type
-- is_primary
+- name
+- data_type
+- is_primary
For each of such a field in Scala API there is a getter and setter
methods. You can call them from the Python too:
@@ -142,7 +140,7 @@ python_property.get_name()
```
You can also modify fields, but be careful: when you modify field of
-instance of the Python class, you modify the underlaying Scala Object
+instance of the Python class, you modify the underlying Scala Object
at the same moment!
```python
@@ -168,7 +166,6 @@ modern_graph =
GraphInfo.load_graph_info("../../testing/modern_graph/modern_grap
After that you can work with such an objects like regular python
objects:
-
```python
print(modern_graph_v_person.dump())
@@ -195,7 +192,7 @@ label: person
version: gar/v1
"
```
-
+
```python
print(modern_graph_v_person.contain_property("id") is True)
print(modern_graph_v_person.contain_property("bad_id?") is False)
@@ -203,6 +200,6 @@ print(modern_graph_v_person.contain_property("bad_id?") is
False)
True
True
```
-
+
Please, refer to Scala API and examples of GraphAr Spark Scala
library to see detailed and business-case oriented examples!
diff --git a/docs/libraries/pyspark/pyspark.md
b/docs/libraries/pyspark/pyspark.md
index 502eac39..cf0f8ddc 100644
--- a/docs/libraries/pyspark/pyspark.md
+++ b/docs/libraries/pyspark/pyspark.md
@@ -65,7 +65,6 @@ GraphAr PySpark uses poetry as a build system. Please refer to
to find the manual how to install this tool. Currently GraphAr PySpark
is build with Python 3.9 and PySpark 3.2
-
Make the graphar-pyspark-library directory as the current working
directory:
@@ -75,7 +74,6 @@ cd incubator-graphar/pyspark
Build package:
-
```bash
poetry build
```
@@ -87,7 +85,6 @@ generated in the directory *pyspark/dist/*.
You cannot install graphar-pyspark from PyPi for now.
-
## How to Use
### Initialization
@@ -97,7 +94,6 @@ Scala. You need to have *spark-x.x.x.jar* in your
*spark-jars*.
Please refer to [GraphAr scala documentation](../spark/spark.md) to get
this JAR.
-
```python
// create a SparkSession from pyspark.sql import SparkSession
diff --git a/docs/libraries/spark/examples.md b/docs/libraries/spark/examples.md
index 144b5ca7..bd0d5d06 100644
--- a/docs/libraries/spark/examples.md
+++ b/docs/libraries/spark/examples.md
@@ -11,7 +11,6 @@ sidebar_position: 1
Examples of this co-working integration have been provided as showcases.
-
### Examples
### Transform GraphAr format files
@@ -24,7 +23,6 @@ the original data is first loaded into a Spark DataFrame
using the GraphAr Spark
Then, the DataFrame is written into generated GraphAr format files through a
GraphAr Spark Writer,
following the meta data defined in a new information file.
-
### Compute with GraphX
Another important use case of GraphAr is to use it as a data source for graph
@@ -33,7 +31,6 @@ a GraphX graph from reading GraphAr format files and
executing a connected-compo
Also, executing queries with Spark SQL and running other graph analytic
algorithms
can be implemented in a similar fashion.
-
### Import/Export graphs of Neo4j
[Neo4j](https://neo4j.com/product/neo4j-graph-database) graph database provides
@@ -210,4 +207,4 @@ See [GraphAr2Neo4j.scala][graphar2neo4j] for the complete
example.
[transformer-example]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/test/scala/org/apache/graphar/TransformExample.scala
[compute-example]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/test/scala/org/apache/graphar/ComputeExample.scala
[neo4j2graphar]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/main/scala/org/apache/graphar/example/Neo4j2GraphAr.scala
-[graphar2neo4j]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/main/scala/org/apache/graphar/example/GraphAr2Neo4j.scala
\ No newline at end of file
+[graphar2neo4j]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/main/scala/org/apache/graphar/example/GraphAr2Neo4j.scala
diff --git a/docs/libraries/spark/spark.md b/docs/libraries/spark/spark.md
index eaf79b8b..76a5993b 100644
--- a/docs/libraries/spark/spark.md
+++ b/docs/libraries/spark/spark.md
@@ -25,7 +25,6 @@ The GraphAr Spark library can be used in a range of scenarios:
For more information on its usage, please refer to the [Examples](examples.md).
-
## Get GraphAr Spark Library
### Building from source
@@ -52,7 +51,6 @@ After compilation, a similar file
*graphar-x.x.x-SNAPSHOT-shaded.jar* is generat
Please refer to the [building
steps](https://github.com/apache/incubator-graphar/tree/main/spark) for more
details.
-
## How to Use
### Information classes
@@ -75,7 +73,6 @@ val version = graph_info.getVersion
See [TestGraphInfo.scala][test-graph-info] for the complete example.
-
### IndexGenerator
The GraphAr file format assigns each vertex with a unique index inside the
vertex type (which called internal vertex id) starting from 0 and increasing
continuously for each type of vertex (i.e., with the same vertex label).
However, the vertex/edge tables in Spark often lack this information, requiring
special attention. For example, an edge table typically uses the primary key
(e.g., "id", which is a string) to identify its source and destination vertices.
@@ -106,7 +103,6 @@ val edge_df_src_dst_index =
IndexGenerator.generateDstIndexForEdgesFromMapping(e
See [TestIndexGenerator.scala][test-index-generator] for the complete example.
-
### Writer
The GraphAr Spark writer provides the necessary Spark interfaces to write
DataFrames into GraphAr formatted files in a batch-import fashion. With the
VertexWriter, users can specify a particular property group to be written into
its corresponding chunks, or choose to write all property groups. For edge
chunks, besides the meta data (edge info), the adjList type should also be
specified. The adjList/properties can be written alone, or alternatively, all
adjList, properties, and the offset [...]
@@ -145,7 +141,6 @@ writer.writeEdges()
See [TestWriter.scala][test-writer] for the complete example.
-
### Reader
The GraphAr Spark reader provides an extensive set of interfaces to read
GraphAr format files. It reads a collection of vertices or edges at a time and
assembles the result into the Spark DataFrame. Similar with the reader in C++
library, it supports the users to specify the data they need, e.g., a single
property group.
@@ -181,7 +176,6 @@ val edge_df = reader.readEdges()
See [TestReader.scala][test-reader] for the complete example.
-
### Graph-level APIs
To improve the usability of the GraphAr Spark library, a set of APIs are
provided to allow users to easily perform operations such as reading, writing,
and transforming data at the graph level. These APIs are fairly easy to use,
while the previous methods of using reader, writer and information classes are
more flexibly and can be highly customized.
@@ -210,8 +204,9 @@ The Graph Transformer can be used for various purposes,
including transforming G
:::note
There are certain limitations while using the Graph Transformer:
- - The vertices (or edges) of the source and destination graphs are aligned
by labels, meaning each vertex/edge label included in the destination graph
must have an equivalent in the source graph, in order for the related chunks to
be loaded as the data source.
- - For each group of vertices/edges (i.e., each single label), each property
included in the destination graph (defined in the relevant VertexInfo/EdgeInfo)
must also be present in the source graph.
+
+- The vertices (or edges) of the source and destination graphs are aligned by
labels, meaning each vertex/edge label included in the destination graph must
have an equivalent in the source graph, in order for the related chunks to be
loaded as the data source.
+- For each group of vertices/edges (i.e., each single label), each property
included in the destination graph (defined in the relevant VertexInfo/EdgeInfo)
must also be present in the source graph.
In addition, users can use the GraphAr Spark Reader/Writer to conduct data
transformation more flexibly at the vertex/edge table level, as opposed to the
graph level. This allows for a more granular approach to transforming data, as
[TransformExample.scala][transform-example] shows.
@@ -241,4 +236,4 @@ The Spark library for GraphAr supports reading and writing
data from/to cloud st
[compute-example]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/test/scala/org/apache/graphar/ComputeExample.scala
[transform-example]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/test/scala/org/apache/graphar/TransformExample.scala
[neo4j2graphar]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/main/scala/org/apache/graphar/example/Neo4j2GraphAr.scala
-[graphar2neo4j]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/main/scala/org/apache/graphar/example/GraphAr2Neo4j.scala
\ No newline at end of file
+[graphar2neo4j]:
https://github.com/apache/incubator-graphar/blob/main/maven-projects/spark/graphar/src/main/scala/org/apache/graphar/example/GraphAr2Neo4j.scala
diff --git a/docs/overview/concepts.md b/docs/overview/concepts.md
index 2039e8dc..7a080cb6 100644
--- a/docs/overview/concepts.md
+++ b/docs/overview/concepts.md
@@ -11,12 +11,12 @@ Glossary of relevant concepts and terms.
group is the unit of storage and is stored in a separate directory.
- **Adjacency List**: The storage method to store the edges of certain vertex
type. Which include:
- - *ordered by source vertex id*: the edges are ordered and aligned by the
source vertex
- - *ordered by destination vertex id*: the edges are ordered and aligned by
the destination vertex
- - *unordered by source vertex id*: the edges are unordered but aligned by
the source vertex
- - *unordered by destination vertex id*: the edges are unordered but
aligned by the destination vertex
+ - *ordered by source vertex id*: the edges are ordered and aligned by the
source vertex
+ - *ordered by destination vertex id*: the edges are ordered and aligned by
the destination vertex
+ - *unordered by source vertex id*: the edges are unordered but aligned by
the source vertex
+ - *unordered by destination vertex id*: the edges are unordered but aligned
by the destination vertex
-- **Compressed Sparse Row (CSR)**: The storage layout the edges of certain
vertex type. Corresponding to the
+- **Compressed Sparse Row (CSR)**: The storage layout the edges of certain
vertex type. Corresponding to the
ordered by source vertex id adjacency list, the edges are stored in a single
array and the offsets of the
edges of each vertex are stored in a separate array.
@@ -29,7 +29,7 @@ Glossary of relevant concepts and terms.
no offsets are stored.
- **Vertex Chunk**: The storage unit of vertex. Each vertex chunk contains a
fixed number of vertices and is stored
- in a separate file.
+ in a separate file.
- **Edge Chunk**: The storage unit of edge. Each edge chunk contains a fixed
number of edges and is stored in a separate file.
diff --git a/docs/overview/motivation.md b/docs/overview/motivation.md
index a262c87b..5551aa89 100644
--- a/docs/overview/motivation.md
+++ b/docs/overview/motivation.md
@@ -4,11 +4,11 @@ title: Motivation
sidebar_position: 2
---
-Numerous graph systems,
-such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in
recent years.
-Each of these systems has its own graph data storage format, complicating the
exchange of graph data between different systems.
+Numerous graph systems,
+such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in
recent years.
+Each of these systems has its own graph data storage format, complicating the
exchange of graph data between different systems.
The need for a standard data file format for large-scale graph data storage
and processing that can be used by diverse existing systems is evident, as it
would reduce overhead when various systems work together.
Our aim is to fill this gap and contribute to the open-source community by
providing a standard data file format for graph data storage and exchange, as
well as for out-of-core querying.
This format, which we have named GraphAr, is engineered to be efficient,
cross-language compatible, and to support out-of-core processing scenarios,
such as those commonly found in data lakes.
-Furthermore, GraphAr's flexible design ensures that it can be easily extended
to accommodate a broader array of graph data storage and exchange use cases in
the future.
\ No newline at end of file
+Furthermore, GraphAr's flexible design ensures that it can be easily extended
to accommodate a broader array of graph data storage and exchange use cases in
the future.
diff --git a/docs/overview/overview.md b/docs/overview/overview.md
index 637278ba..3ed3ad72 100644
--- a/docs/overview/overview.md
+++ b/docs/overview/overview.md
@@ -13,4 +13,5 @@ It is intended to serve as the standard file format for
importing/exporting and
Additionally, it can also serve as the direct data source for graph processing
applications.
### [Motivation](/docs/overview/motivation)
+
### [Concepts](/docs/overview/concepts)
diff --git a/docs/specification/format.md b/docs/specification/format.md
index b456227f..3859b5f1 100644
--- a/docs/specification/format.md
+++ b/docs/specification/format.md
@@ -5,8 +5,8 @@ sidebar_position: 1
## Property Graph
-GraphAr is designed for representing and storing the property graphs. Graph
(in discrete mathematics) is a structure made of vertices and edges.
-Property graph is then a type of graph model where the vertices/edges could
carry a name (also called as type or label) and some properties.
+GraphAr is designed for representing and storing the property graphs. Graph
(in discrete mathematics) is a structure made of vertices and edges.
+Property graph is then a type of graph model where the vertices/edges could
carry a name (also called as type or label) and some properties.
Since carrying additional information than non-property graphs, the property
graph is able to represent
connections among data scattered across diverse data databases and with
different schemas.
Compared with the relational database schema, the property graph excels at
showing data dependencies.
@@ -16,7 +16,7 @@ network routing, scientific computing and so on.
A property graph consists of vertices and edges, with each vertex contains a
unique identifier and:
- A text label that describes the vertex type.
-- A collection of properties, with each property can be represented by a
key-value pair.
+- A collection of properties, with each property can be represented by a
key-value pair.
Each edge contains a unique identifier and:
@@ -33,7 +33,7 @@ The following is an example property graph containing two
types of vertices ("pe
GraphAr support a set of built-in property data types that are common in real
use cases and supported by most file types (CSV, ORC, Parquet), includes:
-- **Boolean**
+- **Boolean**
- **Int32**: Integer with 32 bits
- **Int64**: Integer with 64 bits
- **Float**: 32-bit floating point values
@@ -45,7 +45,7 @@ GraphAr support a set of built-in property data types that
are common in real us
- **List**: A list of values of the same type
GraphAr also supports the user-defined data types, which can be used to
represent complex data structures,
-such as the struct, map, and union types.
+such as the struct, map, and union types.
## Configurations
@@ -85,10 +85,9 @@ Adjacency list is a data structure used to represent the
edges of a graph. Graph
- **unordered_by_source**: the internal id of the source vertex is used as the
partition key to divide the edges into different sub-logical-tables, and the
edges in each sub-logical-table are unordered, which can be seen as the COO
format.
- **unordered_by_dest**: the internal id of the destination vertex is used as
the partition key to divide the edges into different sub-logical-tables, and
the edges in each sub-logical-table are unordered, which can also be seen as
the COO format.
-
## Vertex Chunks in GraphAr
-### Logical table of vertices
+### Logical table of vertices
Each type of vertices (with the same label) constructs a logical vertex table,
with each vertex assigned with a global index inside this type (called internal
vertex id) starting from 0, corresponding to the row number of the vertex in
the logical vertex table. An example layout for a logical table of vertices
under the label "person" is provided for reference.
@@ -102,7 +101,7 @@ In the logical vertex table, some property can be marked as
the primary key, suc
:::
-### Physical table of vertices
+### Physical table of vertices
The logical vertex table will be partitioned into multiple continuous vertex
chunks for enhancing the reading/writing efficiency. To maintain the ability of
random access, the size of vertex chunks for the same label is fixed. To
support to access required properties avoiding reading all properties from the
files, and to add properties for vertices without modifying the existing files,
the columns of the logical table will be divided into several column groups.
@@ -116,7 +115,7 @@ For efficiently utilize the filter push-down of the payload
file format like Par
:::
-## Edge Chunks in GraphAr
+## Edge Chunks in GraphAr
### Logical table of edges
@@ -187,11 +186,11 @@ See also [Gar Information
Files](https://graphar.apache.org/docs/libraries/cpp/g
As previously mentioned, each logical vertex/edge table is divided into
multiple physical tables stored in one of the following file formats:
- [Apache ORC](https://orc.apache.org/)
-- [Apache Parquet](https://parquet.apache.org/)
+- [Apache Parquet](https://parquet.apache.org/)
- CSV
- JSON
-Both of Apache ORC and Apache Parquet are column-oriented data storage
formats. In practice of graph processing, it is common to only query a subset
of columns of the properties. Thus, the column-oriented formats are more
efficient, which eliminate the need to read columns that are not relevant. They
are also used by a large number of data processing frameworks like [Apache
Spark](https://spark.apache.org/), [Apache Hive](https://hive.apache.org/),
[Apache Flink](https://flink.apache.org [...]
+Both of Apache ORC and Apache Parquet are column-oriented data storage
formats. In practice of graph processing, it is common to only query a subset
of columns of the properties. Thus, the column-oriented formats are more
efficient, which eliminate the need to read columns that are not relevant. They
are also used by a large number of data processing frameworks like [Apache
Spark](https://spark.apache.org/), [Apache Hive](https://hive.apache.org/),
[Apache Flink](https://flink.apache.org [...]
See also [GraphAr Data
Files](https://graphar.apache.org/docs/libraries/cpp/getting-started#gar-data-files)
for an example.
diff --git a/docs/specification/implementation-status.md
b/docs/specification/implementation-status.md
index df19d54d..6e634714 100644
--- a/docs/specification/implementation-status.md
+++ b/docs/specification/implementation-status.md
@@ -60,7 +60,6 @@ Supported compression methods for the file formats:
:::
-
## Property
| Property feature | C++ | Java | Scala | Python |
@@ -68,7 +67,6 @@ Supported compression methods for the file formats:
| primary key | ✓ | ✓ | ✓ | ✓ |
| nullable | ✓ | | ✓ | ✓ |
-
Supported operations in Property:
| Property operation| C++ | Java | Scala | Python |
@@ -78,13 +76,12 @@ Supported operations in Property:
| is_primary_key | ✓ | ✓ (1) | ✓ | ✓ (2) |
| is_nullable | ✓ | | ✓ | ✓ (2) |
-
## Property Group
| Property Group (operation) | C++ |Java (1)| Scala | Python (2)|
|-------------------|-------|--------|-------|------------|
| create | ✓ | ✓ | ✓ | ✓ |
-| add property | ✓ | ✓ | ✓ | ✓ |
+| add property | ✓ | ✓ | ✓ | ✓ |
| remove property | | | | |
| get properties | ✓ | ✓ | ✓ | ✓ |
| check property | ✓ | ✓ | | |
@@ -92,7 +89,6 @@ Supported operations in Property:
| get path prefix | ✓ | ✓ | ✓ | ✓ |
| check validation | ✓ | | | |
-
## Adjacency List
| Adjacency List (type) | C++ | Java | Scala | Python |
@@ -111,7 +107,6 @@ Supported operations in Adjacency List:
| get path prefix | ✓ | | ✓ | ✓ |
| check validation | ✓ | | | |
-
## Vertex
Vertex features:
@@ -125,8 +120,8 @@ Vertex features:
:::note
-* *label* is the vertex label, which is a unique identifier for the vertex.
-* *tag* is the vertex tag, which is tag or category for the vertex.
+- *label* is the vertex label, which is a unique identifier for the vertex.
+- *tag* is the vertex tag, which is tag or category for the vertex.
:::
@@ -146,7 +141,6 @@ Supported operations in Vertex Info:
| serialize | ✓ | ✓ | ✓ | ✓ |
| deserialize | ✓ | ✓ | ✓ | ✓ |
-
## Edge
Edge features:
@@ -190,11 +184,10 @@ Supported operations in Edge Info:
:::
-
## Graph
| Graph | C++ | Java | Scala | Python |
-|-------------------|-------|-------|-------|------------|
+|-------------------|-------|-------|-------|------------|
| labeled vertex (with property) | ✓ | ✓ | ✓ | ✓ |
| labeled edge (with property) | ✓ | ✓ | ✓ | ✓ |
| extra info | ✓ | | | |
@@ -226,7 +219,6 @@ Supported operations in Graph Info:
:::
-
## Libraries Version Compatibility
| GraphAr C++ Version | C++ | CMake | Format Version |
diff --git a/licenserc.toml b/licenserc.toml
index 56db55c8..5353b95b 100644
--- a/licenserc.toml
+++ b/licenserc.toml
@@ -25,6 +25,7 @@ excludes = [
# Documents
"**/*.md",
"**/*.mdx",
+ "docs/.markdownlint.yaml",
# Meta files
"NOTICE",
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]