This is an automated email from the ASF dual-hosted git repository.
lixueclaire pushed a commit to branch research
in repository https://gitbox.apache.org/repos/asf/incubator-graphar.git
The following commit(s) were added to refs/heads/research by this push:
new 6606e138 revise the readme in research branch. (#636)
6606e138 is described below
commit 6606e138e46f28ae5a7689adebdf4bbe6309130a
Author: Jingbo Xu <[email protected]>
AuthorDate: Tue Sep 24 16:39:32 2024 +0800
revise the readme in research branch. (#636)
---
README.md | 171 +++++++++++++++++++++-----------------------------------------
1 file changed, 57 insertions(+), 114 deletions(-)
diff --git a/README.md b/README.md
index e482394f..52c257e5 100644
--- a/README.md
+++ b/README.md
@@ -1,166 +1,115 @@
<h1 align="center" style="clear: both;">
- <img src="docs/images/graphar-logo.svg" width="350" alt="GraphAr">
+ <img src="docs/images/graphar-logo.svg" width="100" alt="GraphAr">
</h1>
-<p align="center">
- An open source, standard data file format for graph data storage and
retrieval
-</p>
-This project is a research initiative by GraphAr (short for "Graph Archive")
aimed at providing an efficient storage scheme for graph data in data lakes. It
is designed to enhance the efficiency of data lakes utilizing the capabilities
of existing formats, with a specific focus on [Apache
Parquet](https://github.com/apache/parquet-format). GraphAr ensures seamless
integration with existing tools and introduces innovative additions
specifically tailored to handle LPGs (Labeled Property Graphs).
+# Artifacts for GraphAr VLDB2025 Submission
-Leveraging the strengths of Parquet, GraphAr captures LPG semantics precisely
and facilitates graph-specific operations such as neighbor retrieval and label
filtering.
-See [GraphAr
Format](https://github.com/apache/incubator-graphar/blob/research/GRAPHAR.md)
for more details about the GraphAr format. And refer to the [Research
Paper](https://arxiv.org/abs/2312.09577) for the detailed design and
implementation of our encoding/decoding techniques.
+> [!NOTE]
+> This branch is provided as artifacts for VLDB2025.
+> For the latest version of GraphAr, please refer to the branch
[main](https://github.com/apache/incubator-graphar).
+This repository contains the artifacts for the VLDB2025 submission of GraphAr,
with all source code and guide to reproduce the results presented in the paper:
-## Dependencies
-
-**GraphAr** is developed and tested on Ubuntu 20.04.5 LTS. It should also work
on other unix-like distributions. Building GraphAr requires the following
software installed as dependencies:
-
-- A C++17-enabled compiler. On Linux, GCC 7.1 or higher should be sufficient.
For macOS, at least Clang 5 is required
-- CMake 3.16 or higher
-- On Linux and macOS, ``make`` build utilities
-- curl-devel with SSL (Linux) or curl (macOS) for s3 filesystem support
-
-
-## Building Steps
-
-### Step 1: Clone the Repository
-
-```bash
- $ git clone https://github.com/apache/incubator-graphar.git
- $ cd incubator-graphar
- $ git checkout research
- $ git submodule update --init
-```
-
-### Step 2: Build the Project
-```bash
- $ mkdir build
- $ cd build
- $ chmod +x ../script/build.sh
- $ ../script/build.sh
-```
-
-## Preparing Graph Data
-
-Before running the benchmarking components, you need to prepare the graph
datasets. You can download from public graph datasets or generate synthetic
graph datasets using our data generator.
-
-### Preparing Topology Graphs
-
-#### Transforming Public Graphs
-
-Suppose we want to use the Facebook dataset. First, download the dataset from
the [SNAP](https://snap.stanford.edu/data/egonets-Facebook.html) website and
extract it.
-As an example, we have already included this Facebook dataset in the `dataset`
directory.
-
-Then, convert the dataset into Parquet format:
-
-```bash
- $ cd incubator-graphar/build
- $ ./release/Csv2Parquet {input_path} {output_path} {header_line_num}
-```
-Or, you could use the following command to convert the dataset into the
GraphAr format:
+- Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu,
+ Jingren Zhou. [Enhancing Data Lakes with GraphAr: Efficient Graph Data
+ Management with a Specialized Storage
+ Scheme\[J\]](https://arxiv.org/abs/2312.09577). arXiv preprint
+ arXiv:2312.09577, 2023.
-```bash
- $ ./release/data-generator {input_path} {output_path} {vertex_num}
{is_directed} {is_weighted} {is_sorted} {is_reversed} {delimiter}
{header_line_num}
-```
-For example, running the command for the Facebook dataset:
-```bash
- $ ./release/data-generator {path_to_graphar}/dataset/facebook/facebook.txt
{path_to_graphar}/dataset/facebook/facebook 4039 false false true false space 0
-```
+## Dependencies
-The above commands will convert the facebook dataset into the Parquet and
GraphAr format and store the output in the `dataset/facebook` directory.
+**GraphAr** is developed and tested on Ubuntu 20.04.5 LTS. Building GraphAr
requires the following software installed as dependencies:
-#### Generating Synthetic Graphs
+- A C++17-enabled compiler and build-essential tools
+- CMake 3.16 or higher
+- curl-devel with SSL (Linux) for s3 filesystem support
-We also provide a data generator to generate synthetic graph datasets. The
data generator is located in the `synthetic` directory. You can use the
following command to generate a synthetic graph dataset:
+## Building Artifacts
```bash
- $ cd synthetic
- $ mkdir build
- $ cd build
- $ cmake ..
- $ make
- $ ./DataGenerator {vertex_num} {output_path} # e.g., ./DataGenerator 100
example-synthetic-graph
+# Clone the repository and checkout the research branch
+git clone https://github.com/apache/incubator-graphar.git
+cd incubator-graphar
+git checkout research
+git submodule update --init
+
+# Build the artifacts
+mkdir build
+cd build
+chmod +x ../script/build.sh
+../script/build.sh
```
-It will generate a synthetic graph with the specified number of vertices in
CSV format. Afterward, you can convert this CSV file into Parquet or GraphAr
format using the `Csv2Parquet` or `data-generator` tool, as described above.
-
-### Preparing Labeled Graphs
-
-### Preparing Label Data
-
-To enable the label filtering benchmarking component, original label data must
be extracted from graphs obtained from various sources. We use a CSV file to
store the original label data, where each row represents a vertex and each
column represents a label, formatted as a binary matrix:
-
-| Label 1 | Label 2 | ... | Label N |
-|---------|---------|-----|---------|
-| 1 | 0 | ... | 0 |
-| 1 | 1 | ... | 0 |
-| ... | ... | ... | ... |
-
-For example, the `dataset/bloom` directory contains the label data for the
[Bloom](https://github.com/neo4j-graph-examples/bloom/tree/main) dataset. This
dataset includes 32,960 vertices and 18 labels. The `dataset/label` directory
contains extracted label data for various datasets outlined in
`script/label_filtering.md`, excluding extremely large datasets.
-
-
-### Graphs from the LDBC Benchmark
-
-Graphs from the LDBC benchmark are generated using the [LDBC SNB Data
Generator](https://ldbcouncil.org/post/snb-data-generator-getting-started/)
tool in CSV format. Each dataset consists of multiple CSV files, where each
file represents a specific edge or vertex type, e.g.,
[person_knows_person_0_0.csv](https://github.com/apache/incubator-graphar-testing/blob/main/ldbc_sample/person_knows_person_0_0.csv)
and
[person_0_0.csv](https://github.com/apache/incubator-graphar-testing/blob/main/
[...]
-Once the original dataset is generated, you can convert it into
Parquet/GraphAr format as described above.
+## Getting Graph Data
-The following command will generate the Parquet and GraphAr files for
`person_knows_person` data of the SF30 dataset:
+TODO: make it available.
+We stored all graph data listed in the paper Table 1 on an Aliyun OSS bucket.
+To download the graphs, please use the command below:
```bash
- $ ../script/generate_ldbc.sh
{path_to_dataset}/sf30/social_network/dynamic/person_knows_person_0_0.csv
{path_to_dataset}/sf30/social_network/dynamic/person_0_0.csv
{path_to_dataset}/sf30/person_knows_person
+./script/dl-data.sh
```
+The data will be downloaded to the `dataset` directory.
-Please refer to `script/generate_ldbc_all.sh` for more details on this
preparation process.
-## Running Benchmarking Components
+## Micro-Benchmark of Neighbor Retrieval
-### Neighbor Retrieval
-
-To run the neighbor retrieval benchmarking component, you can use the
following command:
+This section outlines the steps to reproduce the neighbor retrieval
benchmarking results reported in Section 6.2 of the paper. You may want to use
the following commands.
```bash
- $ ../script/run_neighbor_retrieval.sh {graph_path} {vertex_num}
{source_vertex}
+../script/run_neighbor_retrieval.sh {graph_path} {vertex_num} {source_vertex}
```
For example:
```bash
- $ ../script/run_neighbor_retrieval.sh
{path_to_graphar}/dataset/facebook/facebook 4039 1642
+../script/run_neighbor_retrieval.sh
{path_to_graphar}/dataset/facebook/facebook 4039 1642
```
Other datasets can be used in the same way, with the corresponding parameters
specified as needed. We also provide a script in
`script/run_neighbor_retrieval_all.sh` for reference.
-### Label Filtering
+## Micro-Benchmark of Label Filtering
+
+This section outlines the steps to reproduce the label filtering benchmarking
results reported in Section 6.3 of the paper.
To run the label filtering benchmarking component, please adjust the
parameters according to the dataset (refer to `script/label_filtering.md`) for
both [simple condition
test](https://github.com/lixueclaire/arrow/blob/encoding-graphar/cpp/examples/parquet/graphar/test-all.cc)
and [complex condition
test](https://github.com/lixueclaire/arrow/blob/encoding-graphar/cpp/examples/parquet/graphar/test.cc).
Then, run the tests using the following commands:
```bash
- $ ./release/parquet-graphar-label-all-example < {graph_path} #
simple-condition filtering
- $ ./release/parquet-graphar-label-example < {graph_path} #
complex-condition filtering
+# simple-condition filtering
+./release/parquet-graphar-label-all-example < {graph_path}
+
+# complex-condition filtering
+./release/parquet-graphar-label-example < {graph_path}
```
For example:
```bash
- $ ./release/parquet-graphar-label-all-example <
{path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
- $ ./release/parquet-graphar-label-example <
{path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
+./release/parquet-graphar-label-all-example <
{path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
+
+./release/parquet-graphar-label-example <
{path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
```
-### End-to-End Workload
+## End-to-End Graph Query Workloads
+
+This section contains the scripts to reproduce the end-to-end graph query
results reported in Section 6.5 of the paper.
Once the LDBC dataset is converted into Parquet and GraphAr format, you can
run the LDBC workload using a command like the following:
```bash
- $ ./release/run-work-load {path_to_dataset}/sf-30/person_knows_person
{path_to_dataset}/sf-30/person_knows_person-vertex-base 165430 70220 delta
+./release/run-work-load {path_to_dataset}/sf-30/person_knows_person
{path_to_dataset}/sf-30/person_knows_person-vertex-base 165430 70220 delta
```
+
This command will run the LDBC workload IS-3 on the SF-30 dataset, formatted
in GraphAr. The total number of person vertices is 165,430, and the query
vertex id is 70,220. The delta parameter specifies the use of the delta
encoding technique. For complete end-to-end LDBC workload execution, please
refer to `script/run-is3.sh`, `script/run-ic8.sh`, and `script/run-bi2.sh`.
## Integration with GraphScope
+This section contains a brief guide on how to reproduce the integration
results with GraphScope, as reported in Section 6.6 of the paper.
+
### Serving as the Archive Format
To run the graph loading benchmarking:
@@ -183,13 +132,7 @@ For running the BI execution benchmarking, please:
- Finally, run the generic benchmark tool for GIE, following the steps
outlined in the
[documentation](https://5165d22e.graphscope-docs-preview.pages.dev/interactive_engine/benchmark_tool).
-## Publication
-
-- Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu,
- Jingren Zhou. [Enhancing Data Lakes with GraphAr: Efficient Graph Data
- Management with a Specialized Storage
- Scheme\[J\]](https://arxiv.org/abs/2312.09577). arXiv preprint
- arXiv:2312.09577, 2023.
+## Citation
Please cite the paper in your publications if our work helps your research.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]