This is an automated email from the ASF dual-hosted git repository.
lixueclaire pushed a commit to branch research
in repository https://gitbox.apache.org/repos/asf/incubator-graphar.git
The following commit(s) were added to refs/heads/research by this push:
new 0e34736d update instructions, scripts and citation (#638)
0e34736d is described below
commit 0e34736dc8b374771cd0dae6d22d96f538c3fefa
Author: lixueclaire <[email protected]>
AuthorDate: Thu Sep 26 15:17:14 2024 +0800
update instructions, scripts and citation (#638)
---
README.md | 44 ++++++++++++++++-------------
dataset/README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
script/download_data.sh | 23 +++++++++++++++
3 files changed, 122 insertions(+), 20 deletions(-)
diff --git a/README.md b/README.md
index 52c257e5..5ce546b8 100644
--- a/README.md
+++ b/README.md
@@ -10,11 +10,7 @@
This repository contains the artifacts for the VLDB2025 submission of GraphAr,
with all source code and guide to reproduce the results presented in the paper:
-- Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu,
- Jingren Zhou. [Enhancing Data Lakes with GraphAr: Efficient Graph Data
- Management with a Specialized Storage
- Scheme\[J\]](https://arxiv.org/abs/2312.09577). arXiv preprint
- arXiv:2312.09577, 2023.
+- Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu, Jingren
Zhou. [GraphAr: An Efficient Storage Scheme for Graph Data in Data
Lakes\[J\]](https://arxiv.org/abs/2312.09577). arXiv preprint arXiv:2312.09577,
2024.
@@ -44,21 +40,25 @@ chmod +x ../script/build.sh
## Getting Graph Data
-TODO: make it available.
-We stored all graph data listed in the paper Table 1 on an Aliyun OSS bucket.
-To download the graphs, please use the command below:
+Table 1 of the paper lists the graphs used in the evaluation, sourced from
various datasets. We offer
[instructions](https://github.com/apache/incubator-graphar/tree/research/dataset)
on how to prepare the data for the evaluation, either for public datasets or
synthetic datasets.
+
+Additionally, we have stored all graph data for benchmarking in an Aliyun OSS
bucket.
+To download the graphs, please use the following command:
```bash
-./script/dl-data.sh
+../script/download_data.sh {path_to_dataset}
```
-The data will be downloaded to the `dataset` directory.
+The data will be downloaded to the specified directory.
+Please be aware that the total size of the data exceeds 2TB, and the download
may take a long time.
+Alternatively, we also provide some small datasets located in the `dataset`
directory for testing purposes.
## Micro-Benchmark of Neighbor Retrieval
This section outlines the steps to reproduce the neighbor retrieval
benchmarking results reported in Section 6.2 of the paper. You may want to use
the following commands.
```bash
+cd incubator-graphar/build
../script/run_neighbor_retrieval.sh {graph_path} {vertex_num} {source_vertex}
```
@@ -94,6 +94,12 @@ For example:
./release/parquet-graphar-label-example <
{path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
```
+## Storage Media
+
+The evaluation of different storage media is reported in Section 6.4 of the
paper. This test employs the same methodology as the previously mentioned
micro-benchmarks, using graph data stored across various storage options.
+The storage media can be specified in the path, e.g.,
`OSS:://bucket/dataset/facebook/facebook`, to indicate that the data is stored
on OSS rather than relying on the local file system.
+
+
## End-to-End Graph Query Workloads
This section contains the scripts to reproduce the end-to-end graph query
results reported in Section 6.5 of the paper.
@@ -137,15 +143,13 @@ For running the BI execution benchmarking, please:
Please cite the paper in your publications if our work helps your research.
``` bibtex
-@article{li2023enhancing,
- author = {Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo Xu
and Wenyuan Yu and Jingren Zhou},
- title = {Enhancing Data Lakes with GraphAr: Efficient Graph Data Management
with a Specialized Storage Scheme},
- year = {2023},
- url = {https://doi.org/10.48550/arXiv.2312.09577},
- doi = {10.48550/ARXIV.2312.09577},
- eprinttype = {arXiv},
- eprint = {2312.09577},
- biburl = {https://dblp.org/rec/journals/corr/abs-2312-09577.bib},
- bibsource = {dblp computer science bibliography, https://dblp.org}
+@misc{li2024grapharefficientstoragescheme,
+ title={GraphAr: An Efficient Storage Scheme for Graph Data in Data
Lakes},
+ author={Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo
Xu and Wenyuan Yu and Jingren Zhou},
+ year={2024},
+ eprint={2312.09577},
+ archivePrefix={arXiv},
+ primaryClass={cs.DB},
+ url={https://arxiv.org/abs/2312.09577},
}
```
diff --git a/dataset/README.md b/dataset/README.md
new file mode 100644
index 00000000..596ac339
--- /dev/null
+++ b/dataset/README.md
@@ -0,0 +1,75 @@
+This page contains instructions to prepare the graph datasets outlined in the
paper.
+
+## Preparing Graph Data
+
+Before running the benchmarking components, you need to prepare the graph
datasets. You can download from public graph datasets or generate synthetic
graph datasets using our data generator.
+
+### Preparing Topology Graphs
+
+#### Transforming Public Graphs
+
+Suppose we want to use the Facebook dataset. First, download the dataset from
the [SNAP](https://snap.stanford.edu/data/egonets-Facebook.html) website and
extract it.
+As an example, we have already included this Facebook dataset in the `dataset`
directory.
+
+Then, convert the dataset into Parquet format:
+
+```bash
+ $ cd incubator-graphar/build
+ $ ./release/Csv2Parquet {input_path} {output_path} {header_line_num}
+```
+Or, you could use the following command to convert the dataset into the
GraphAr format:
+
+```bash
+ $ ./release/data-generator {input_path} {output_path} {vertex_num}
{is_directed} {is_weighted} {is_sorted} {is_reversed} {delimiter}
{header_line_num}
+```
+
+For example, running the command for the Facebook dataset:
+
+```bash
+ $ ./release/data-generator {path_to_graphar}/dataset/facebook/facebook.txt
{path_to_graphar}/dataset/facebook/facebook 4039 false false true false space 0
+```
+
+The above commands will convert the facebook dataset into the Parquet and
GraphAr format and store the output in the `dataset/facebook` directory.
+
+#### Generating Synthetic Graphs
+
+We also provide a data generator to generate synthetic graph datasets. The
data generator is located in the `synthetic` directory. You can use the
following command to generate a synthetic graph dataset:
+
+```bash
+ $ cd synthetic
+ $ mkdir build
+ $ cd build
+ $ cmake ..
+ $ make
+ $ ./DataGenerator {vertex_num} {output_path} # e.g., ./DataGenerator 100
example-synthetic-graph
+```
+
+It will generate a synthetic graph with the specified number of vertices in
CSV format. Afterward, you can convert this CSV file into Parquet or GraphAr
format using the `Csv2Parquet` or `data-generator` tool, as described above.
+
+### Preparing Labeled Graphs
+
+### Preparing Label Data
+
+To enable the label filtering benchmarking component, original label data must
be extracted from graphs obtained from various sources. We use a CSV file to
store the original label data, where each row represents a vertex and each
column represents a label, formatted as a binary matrix:
+
+| Label 1 | Label 2 | ... | Label N |
+|---------|---------|-----|---------|
+| 1 | 0 | ... | 0 |
+| 1 | 1 | ... | 0 |
+| ... | ... | ... | ... |
+
+For example, the `dataset/bloom` directory contains the label data for the
[Bloom](https://github.com/neo4j-graph-examples/bloom/tree/main) dataset. This
dataset includes 32,960 vertices and 18 labels. The `dataset/label` directory
contains extracted label data for various datasets outlined in
`script/label_filtering.md`, excluding extremely large datasets.
+
+
+### Graphs from the LDBC Benchmark
+
+Graphs from the LDBC benchmark are generated using the [LDBC SNB Data
Generator](https://ldbcouncil.org/post/snb-data-generator-getting-started/)
tool in CSV format. Each dataset consists of multiple CSV files, where each
file represents a specific edge or vertex type, e.g.,
[person_knows_person_0_0.csv](https://github.com/apache/incubator-graphar-testing/blob/main/ldbc_sample/person_knows_person_0_0.csv)
and
[person_0_0.csv](https://github.com/apache/incubator-graphar-testing/blob/main/
[...]
+Once the original dataset is generated, you can convert it into
Parquet/GraphAr format as described above.
+
+The following command will generate the Parquet and GraphAr files for
`person_knows_person` data of the SF30 dataset:
+
+```bash
+ $ ../script/generate_ldbc.sh
{path_to_dataset}/sf30/social_network/dynamic/person_knows_person_0_0.csv
{path_to_dataset}/sf30/social_network/dynamic/person_0_0.csv
{path_to_dataset}/sf30/person_knows_person
+```
+
+Please refer to `script/generate_ldbc_all.sh` for more details on this
preparation process.
diff --git a/script/download_data.sh b/script/download_data.sh
new file mode 100644
index 00000000..40e7a0e1
--- /dev/null
+++ b/script/download_data.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+# Check if the DATA_PATH argument is provided
+if [ "$#" -ne 1 ]; then
+ echo "Usage: \$0 <DATA_PATH>"
+ exit 1
+fi
+
+DATA_PATH=$1
+
+# Install ossutil
+echo "Installing ossutil..."
+curl -s https://gosspublic.alicdn.com/ossutil/install.sh | bash
+
+echo "Setting up your config, refer to the documentation:
https://www.alibabacloud.com/help/en/oss/developer-reference/install-ossutil#concept-303829"
+echo "The endpoint is: oss-cn-beijing.aliyuncs.com"
+ossutil config
+
+# Download data from OSS
+echo "Downloading data from OSS to $DATA_PATH..."
+ossutil cp -r oss://graphscope/graphar_artifact "$DATA_PATH"
+
+echo "Download complete."
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]