(incubator-graphar) branch research updated: update instructions, scripts and citation (#638)

lixueclaire Thu, 26 Sep 2024 00:17:38 -0700

This is an automated email from the ASF dual-hosted git repository.

lixueclaire pushed a commit to branch research
in repository https://gitbox.apache.org/repos/asf/incubator-graphar.git



The following commit(s) were added to refs/heads/research by this push:
     new 0e34736d update instructions, scripts and citation (#638)
0e34736d is described below

commit 0e34736dc8b374771cd0dae6d22d96f538c3fefa
Author: lixueclaire <[email protected]>
AuthorDate: Thu Sep 26 15:17:14 2024 +0800

    update instructions, scripts and citation (#638)
---
 README.md               | 44 ++++++++++++++++-------------
 dataset/README.md       | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
 script/download_data.sh | 23 +++++++++++++++
 3 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/README.md b/README.md
index 52c257e5..5ce546b8 100644
--- a/README.md
+++ b/README.md
@@ -10,11 +10,7 @@
 
 This repository contains the artifacts for the VLDB2025 submission of GraphAr, 
with all source code and guide to reproduce the results presented in the paper:
 
-- Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu,
-  Jingren Zhou. [Enhancing Data Lakes with GraphAr: Efficient Graph Data
-  Management with a Specialized Storage
-  Scheme\[J\]](https://arxiv.org/abs/2312.09577). arXiv preprint
-  arXiv:2312.09577, 2023.
+- Xue Li, Weibin Zeng, Zhibin Wang, Diwen Zhu, Jingbo Xu, Wenyuan Yu, Jingren 
Zhou. [GraphAr: An Efficient Storage Scheme for Graph Data in Data 
Lakes\[J\]](https://arxiv.org/abs/2312.09577). arXiv preprint arXiv:2312.09577, 
2024.
 
 
 
@@ -44,21 +40,25 @@ chmod +x ../script/build.sh
 
 ## Getting Graph Data
 
-TODO: make it available.    
-We stored all graph data listed in the paper Table 1 on an Aliyun OSS bucket.
-To download the graphs, please use the command below:
+Table 1 of the paper lists the graphs used in the evaluation, sourced from 
various datasets. We offer 
[instructions](https://github.com/apache/incubator-graphar/tree/research/dataset)
 on how to prepare the data for the evaluation, either for public datasets or 
synthetic datasets.
+
+Additionally, we have stored all graph data for benchmarking in an Aliyun OSS 
bucket. 
+To download the graphs, please use the following command:
 
 ```bash
-./script/dl-data.sh
+../script/download_data.sh {path_to_dataset}
 ```
-The data will be downloaded to the `dataset` directory.
 
+The data will be downloaded to the specified directory.
+Please be aware that the total size of the data exceeds 2TB, and the download 
may take a long time.
+Alternatively, we also provide some small datasets located in the `dataset` 
directory for testing purposes.
 
 ## Micro-Benchmark of Neighbor Retrieval
 
 This section outlines the steps to reproduce the neighbor retrieval 
benchmarking results reported in Section 6.2 of the paper. You may want to use 
the following commands.
 
 ```bash
+cd incubator-graphar/build
 ../script/run_neighbor_retrieval.sh {graph_path} {vertex_num} {source_vertex}
 ```
 
@@ -94,6 +94,12 @@ For example:
 ./release/parquet-graphar-label-example < 
{path_to_graphar}/dataset/bloom/bloom-43-nodes.csv
 ```
 
+## Storage Media
+
+The evaluation of different storage media is reported in Section 6.4 of the 
paper. This test employs the same methodology as the previously mentioned 
micro-benchmarks, using graph data stored across various storage options. 
+The storage media can be specified in the path, e.g., 
`OSS:://bucket/dataset/facebook/facebook`, to indicate that the data is stored 
on OSS rather than relying on the local file system.
+
+
 ## End-to-End Graph Query Workloads
 
 This section contains the scripts to reproduce the end-to-end graph query 
results reported in Section 6.5 of the paper.
@@ -137,15 +143,13 @@ For running the BI execution benchmarking, please:
 Please cite the paper in your publications if our work helps your research.
 
 ``` bibtex
-@article{li2023enhancing,
-  author = {Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo Xu 
and Wenyuan Yu and Jingren Zhou},
-  title = {Enhancing Data Lakes with GraphAr: Efficient Graph Data Management 
with a Specialized Storage Scheme},
-  year = {2023},
-  url = {https://doi.org/10.48550/arXiv.2312.09577},
-  doi = {10.48550/ARXIV.2312.09577},
-  eprinttype = {arXiv},
-  eprint = {2312.09577},
-  biburl = {https://dblp.org/rec/journals/corr/abs-2312-09577.bib},
-  bibsource = {dblp computer science bibliography, https://dblp.org}
+@misc{li2024grapharefficientstoragescheme,
+      title={GraphAr: An Efficient Storage Scheme for Graph Data in Data 
Lakes}, 
+      author={Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo 
Xu and Wenyuan Yu and Jingren Zhou},
+      year={2024},
+      eprint={2312.09577},
+      archivePrefix={arXiv},
+      primaryClass={cs.DB},
+      url={https://arxiv.org/abs/2312.09577}, 
 }
 ```
diff --git a/dataset/README.md b/dataset/README.md
new file mode 100644
index 00000000..596ac339
--- /dev/null
+++ b/dataset/README.md
@@ -0,0 +1,75 @@
+This page contains instructions to prepare the graph datasets outlined in the 
paper.
+
+## Preparing Graph Data
+
+Before running the benchmarking components, you need to prepare the graph 
datasets. You can download from public graph datasets or generate synthetic 
graph datasets using our data generator.
+
+### Preparing Topology Graphs
+
+#### Transforming Public Graphs
+
+Suppose we want to use the Facebook dataset. First, download the dataset from 
the [SNAP](https://snap.stanford.edu/data/egonets-Facebook.html) website and 
extract it.
+As an example, we have already included this Facebook dataset in the `dataset` 
directory.
+
+Then, convert the dataset into Parquet format:
+
+```bash
+    $ cd incubator-graphar/build
+    $ ./release/Csv2Parquet {input_path} {output_path} {header_line_num}
+```
+Or, you could use the following command to convert the dataset into the 
GraphAr format:
+
+```bash
+    $ ./release/data-generator {input_path} {output_path} {vertex_num} 
{is_directed} {is_weighted} {is_sorted} {is_reversed} {delimiter} 
{header_line_num}
+```
+
+For example, running the command for the Facebook dataset:
+
+```bash
+    $ ./release/data-generator {path_to_graphar}/dataset/facebook/facebook.txt 
{path_to_graphar}/dataset/facebook/facebook 4039 false false true false space 0
+```
+
+The above commands will convert the facebook dataset into the Parquet and 
GraphAr format and store the output in the `dataset/facebook` directory.
+
+#### Generating Synthetic Graphs
+
+We also provide a data generator to generate synthetic graph datasets. The 
data generator is located in the `synthetic` directory. You can use the 
following command to generate a synthetic graph dataset:
+
+```bash
+    $ cd synthetic
+    $ mkdir build
+    $ cd build
+    $ cmake ..
+    $ make
+    $ ./DataGenerator {vertex_num} {output_path} # e.g., ./DataGenerator 100 
example-synthetic-graph
+```
+
+It will generate a synthetic graph with the specified number of vertices in 
CSV format. Afterward, you can convert this CSV file into Parquet or GraphAr 
format using the `Csv2Parquet` or `data-generator` tool, as described above.
+
+### Preparing Labeled Graphs
+
+### Preparing Label Data
+
+To enable the label filtering benchmarking component, original label data must 
be extracted from graphs obtained from various sources. We use a CSV file to 
store the original label data, where each row represents a vertex and each 
column represents a label, formatted as a binary matrix:
+
+| Label 1 | Label 2 | ... | Label N |
+|---------|---------|-----|---------|
+| 1       | 0       | ... | 0       |
+| 1       | 1       | ... | 0       |
+| ...     | ...     | ... | ...     |
+
+For example, the `dataset/bloom` directory contains the label data for the 
[Bloom](https://github.com/neo4j-graph-examples/bloom/tree/main) dataset. This 
dataset includes 32,960 vertices and 18 labels. The `dataset/label` directory 
contains extracted label data for various datasets outlined in 
`script/label_filtering.md`, excluding extremely large datasets.
+
+
+### Graphs from the LDBC Benchmark
+
+Graphs from the LDBC benchmark are generated using the [LDBC SNB Data 
Generator](https://ldbcouncil.org/post/snb-data-generator-getting-started/) 
tool in CSV format. Each dataset consists of multiple CSV files, where each 
file represents a specific edge or vertex type, e.g., 
[person_knows_person_0_0.csv](https://github.com/apache/incubator-graphar-testing/blob/main/ldbc_sample/person_knows_person_0_0.csv)
 and 
[person_0_0.csv](https://github.com/apache/incubator-graphar-testing/blob/main/ 
[...]
+Once the original dataset is generated, you can convert it into 
Parquet/GraphAr format as described above.
+
+The following command will generate the Parquet and GraphAr files for 
`person_knows_person` data of the SF30 dataset:
+
+```bash
+    $ ../script/generate_ldbc.sh 
{path_to_dataset}/sf30/social_network/dynamic/person_knows_person_0_0.csv  
{path_to_dataset}/sf30/social_network/dynamic/person_0_0.csv 
{path_to_dataset}/sf30/person_knows_person
+```
+
+Please refer to `script/generate_ldbc_all.sh` for more details on this 
preparation process.
diff --git a/script/download_data.sh b/script/download_data.sh
new file mode 100644
index 00000000..40e7a0e1
--- /dev/null
+++ b/script/download_data.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+# Check if the DATA_PATH argument is provided
+if [ "$#" -ne 1 ]; then
+    echo "Usage: \$0 <DATA_PATH>"
+    exit 1
+fi
+
+DATA_PATH=$1
+
+# Install ossutil
+echo "Installing ossutil..."
+curl -s https://gosspublic.alicdn.com/ossutil/install.sh | bash
+
+echo "Setting up your config, refer to the documentation: 
https://www.alibabacloud.com/help/en/oss/developer-reference/install-ossutil#concept-303829";
+echo "The endpoint is: oss-cn-beijing.aliyuncs.com"
+ossutil config
+
+# Download data from OSS
+echo "Downloading data from OSS to $DATA_PATH..."
+ossutil cp -r oss://graphscope/graphar_artifact "$DATA_PATH"
+
+echo "Download complete."
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-graphar) branch research updated: update instructions, scripts and citation (#638)

Reply via email to