Re: [PR] docs: Add README to tpch directory [datafusion-ray]

via GitHub Sat, 08 Mar 2025 13:47:51 -0800


robtandy commented on code in PR #79:
URL: https://github.com/apache/datafusion-ray/pull/79#discussion_r1986161055



##########
tpch/README.md:
##########
@@ -0,0 +1,120 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Benchmarking DataFusion Ray on Kubernetes
+
+This is a rough guide to deploying and benchmarking DataFusion Ray on 
Kubernetes as part of the development process.
+
+## Building Wheels
+
+Follow the instructions in the [contributor guide] to set up a development 
environment and then build the project 
+using the following command. Note that the `--strip` argument is important for 
keeping the wheel size below 100MB
+
+[contributor guide]: ../docs/contributing.md
+
+```shell
+maturin build --strip
+```
+
+## Create a Ray Cluster
+
+Follow the instructions in the [Ray on Kubernetes] documentation to install 
the KubeRay operator.
+
+[Ray on Kubernetes]: 
https://docs.ray.io/en/latest/cluster/kubernetes/index.html
+
+Create a `datafusion-ray.yaml` file using the following template. It is 
important that the Ray Docker image uses the 
+same Python version that was used to build the wheels. This example yaml 
assumes that the TPC-H data files are 
+available locally on each node in the cluster at the path `/mnt/bigdata`. If 
the data is stored on object storage then 
+the `volume` and `volumeMount` sections can be removed.
+
+```yaml
+apiVersion: ray.io/v1alpha1
+kind: RayCluster
+metadata:
+  name: datafusion-ray-cluster
+spec:
+  headGroupSpec:
+    rayStartParams:
+      num-cpus: "0"
+    template:
+      spec:
+        containers:
+          - name: ray-head
+            image: rayproject/ray:2.43.0-py312-cpu
+            imagePullPolicy: IfNotPresent
+            resources:
+              limits:
+                cpu: 2
+                memory: 8Gi
+              requests:
+                cpu: 2
+                memory: 8Gi
+            volumeMounts:
+              - mountPath: /mnt/bigdata  # Mount path inside the container
+                name: ray-storage
+        volumes:
+          - name: ray-storage
+            persistentVolumeClaim:
+              claimName: ray-pvc
+  workerGroupSpecs:
+    - replicas: 2
+      groupName: "datafusion-ray"
+      rayStartParams:
+        num-cpus: "4"
+      template:
+        spec:
+          containers:
+            - name: ray-worker
+              image: rayproject/ray:2.43.0-py312-cpu
+              imagePullPolicy: IfNotPresent
+              resources:
+                limits:
+                  cpu: 5
+                  memory: 64Gi
+                requests:
+                  cpu: 5
+                  memory: 64Gi
+              volumeMounts:
+                - mountPath: /mnt/bigdata
+                  name: ray-storage
+          volumes:
+            - name: ray-storage
+              persistentVolumeClaim:
+                claimName: ray-pvc
+```
+
+Run the following command to create the cluster:
+
+```shell
+kubectl apply -f datafusion-ray.yaml
+```
+
+Once the cluster is running, set up port forwarding on port 8265 on the head 
node and then run the following 
+command to run the benchmarks.
+
+```shell
+ray job submit --address='http://localhost:8265' \
+  --runtime-env-json='{"pip":["datafusion", "tabulate", "boto3", "duckdb"], 
"py_modules":["/home/andy/git/apache/datafusion-ray/target/wheels/datafusion_ray-0.1.0-cp38-abi3-manylinux_2_35_x86_64.whl"],
 "working_dir":"./", "env_vars":{"RAY_DEDUP_LOGS":"O", 
"RAY_COLOR_PREFIX":"1"}}' -- \
+  python tpcbench.py \
+  --data /mnt/bigdata/tpch/sf100 \

Review Comment:
   Do you want want to include the k8s resource definition for the 
`ray-storage` pvc as well?



##########
tpch/README.md:
##########
@@ -0,0 +1,120 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Benchmarking DataFusion Ray on Kubernetes
+
+This is a rough guide to deploying and benchmarking DataFusion Ray on 
Kubernetes as part of the development process.
+
+## Building Wheels
+
+Follow the instructions in the [contributor guide] to set up a development 
environment and then build the project 
+using the following command. Note that the `--strip` argument is important for 
keeping the wheel size below 100MB
+
+[contributor guide]: ../docs/contributing.md
+
+```shell
+maturin build --strip
+```
+
+## Create a Ray Cluster
+
+Follow the instructions in the [Ray on Kubernetes] documentation to install 
the KubeRay operator.
+
+[Ray on Kubernetes]: 
https://docs.ray.io/en/latest/cluster/kubernetes/index.html
+
+Create a `datafusion-ray.yaml` file using the following template. It is 
important that the Ray Docker image uses the 
+same Python version that was used to build the wheels. This example yaml 
assumes that the TPC-H data files are 
+available locally on each node in the cluster at the path `/mnt/bigdata`. If 
the data is stored on object storage then 
+the `volume` and `volumeMount` sections can be removed.
+
+```yaml
+apiVersion: ray.io/v1alpha1
+kind: RayCluster
+metadata:
+  name: datafusion-ray-cluster
+spec:
+  headGroupSpec:
+    rayStartParams:
+      num-cpus: "0"
+    template:
+      spec:
+        containers:
+          - name: ray-head
+            image: rayproject/ray:2.43.0-py312-cpu
+            imagePullPolicy: IfNotPresent
+            resources:
+              limits:
+                cpu: 2
+                memory: 8Gi
+              requests:
+                cpu: 2
+                memory: 8Gi
+            volumeMounts:
+              - mountPath: /mnt/bigdata  # Mount path inside the container
+                name: ray-storage
+        volumes:
+          - name: ray-storage
+            persistentVolumeClaim:
+              claimName: ray-pvc
+  workerGroupSpecs:
+    - replicas: 2
+      groupName: "datafusion-ray"
+      rayStartParams:
+        num-cpus: "4"
+      template:
+        spec:
+          containers:
+            - name: ray-worker
+              image: rayproject/ray:2.43.0-py312-cpu
+              imagePullPolicy: IfNotPresent
+              resources:
+                limits:
+                  cpu: 5
+                  memory: 64Gi
+                requests:
+                  cpu: 5
+                  memory: 64Gi
+              volumeMounts:
+                - mountPath: /mnt/bigdata
+                  name: ray-storage
+          volumes:
+            - name: ray-storage
+              persistentVolumeClaim:
+                claimName: ray-pvc
+```
+
+Run the following command to create the cluster:
+
+```shell
+kubectl apply -f datafusion-ray.yaml
+```
+
+Once the cluster is running, set up port forwarding on port 8265 on the head 
node and then run the following 
+command to run the benchmarks.
+
+```shell
+ray job submit --address='http://localhost:8265' \
+  --runtime-env-json='{"pip":["datafusion", "tabulate", "boto3", "duckdb"], 
"py_modules":["/home/andy/git/apache/datafusion-ray/target/wheels/datafusion_ray-0.1.0-cp38-abi3-manylinux_2_35_x86_64.whl"],
 "working_dir":"./", "env_vars":{"RAY_DEDUP_LOGS":"O", 
"RAY_COLOR_PREFIX":"1"}}' -- \

Review Comment:
   The path to the .wheel here should perhaps be changed. 
   
   Also, `tabulate`, `boto3`, and `duckdb` are not required anymore I don't 
think.



##########
tpch/README.md:
##########
@@ -0,0 +1,120 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Benchmarking DataFusion Ray on Kubernetes
+
+This is a rough guide to deploying and benchmarking DataFusion Ray on 
Kubernetes as part of the development process.
+
+## Building Wheels
+
+Follow the instructions in the [contributor guide] to set up a development 
environment and then build the project 
+using the following command. Note that the `--strip` argument is important for 
keeping the wheel size below 100MB
+
+[contributor guide]: ../docs/contributing.md
+
+```shell
+maturin build --strip

Review Comment:
   I would add a comment here reminding users to add `--release` for a release 
wheel build



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: Add README to tpch directory [datafusion-ray]

Reply via email to