jojochuang opened a new pull request, #8556: URL: https://github.com/apache/ozone/pull/8556
## What changes were proposed in this pull request? HDDS-13165. [Docs] Python client developer guide. Please describe your PR in detail: * Added interface/Python.md: overall Python client access introduction. * Recipe: Access Ozone using PyArrow (Docker Quickstart) * Recipe: Access Ozone using Boto3 (Docker Quickstart) * Recipe: Access Ozone using HTTPFS REST API (Docker + Python Requests) For interface/Python.md, the draft was generated using ChatGPT 4o using the prompt: ``` Create a user document in Markdown format for Python developers who want to access Apache Ozone. This document will be part of the Ozone Client Interfaces page: https://ozone.apache.org/docs/edge/interface.html. 📌 *Audience*: Python developers familiar with Python integration and Ozone. Skip the introduction. 📌 *Structure*: Setup and Prerequisites: Required libraries (PyArrow, Boto3, WebHDFS) Required configurations (e.g., HADOOP_CONF_DIR, Ozone URIs, credentials, authentication) Access Method 1: PyArrow with libhdfs Setup steps (including any system paths or environment variables) Python code sample (validate for correctness) Access Method 2: Boto3 with Ozone S3 Gateway Setup steps (including Ozone S3 endpoint format, bucket naming conventions, credentials) Python code sample (validate for correctness) Access Method 3: WebHDFS/HttpFS or REST API Setup steps (including endpoint URL, authentication) Python code sample (using requests or webhdfs) Access from PySpark Configuration settings in Spark (fs.ozone. settings) Python code sample for reading/writing data to Ozone Troubleshooting Tips Common issues (e.g., authentication failures, connection errors) Suggested debugging techniques References and Further Resources Links to official Ozone documentation, PyArrow, Boto3, WebHDFS, PySpark 📌 *Markdown Format*: Use proper headers (##, ###) for each section. Include Python syntax highlighting in code blocks (```python). Use clear formatting and spacing for readability. Include warnings or notes where appropriate (e.g., > *Note:*). If applicable, include a simple diagram showing connection flows. 📌 *Quality Checks*: Validate all code samples for correctness. Ensure the document is clear and concise. Focus only on actionable instructions and setup information. Generate the complete Markdown document in response. Include a Hugo header. Include Apache License header ``` The PyArrow recipe draft was generated using ChatGPT 4o prompt: ``` I personally verified the following steps using Ozone's Docker image. Please rewrite in a user tutorial format. PyArrow to access Ozone # Download the latest Docker Compose configuration file curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml docker compose up -d --scale datanode=3 connect to the SCM container: docker exec -it weichiu-scm-1 bash ozone sh volume create volume ozone sh bucket create volume/bucket pip install pyarrow curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’ or curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’ export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/ export CLASSPATH=$(ozone classpath ozone-tools) Add to /etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>ofs://om:9862</value> <description>Where HDFS NameNode can be found on the network</description> </property> </configuration> Code: #!/usr/bin/python import pyarrow.fs as pafs # Create Hadoop FileSystem object fs = pafs.HadoopFileSystem("default", 9864) fs.create_dir("volume/bucket/aaa") path = "volume/bucket/file1" with fs.open_output_stream(path) as stream: stream.write(b'data') ``` The Boto3 recipe draft was generated using ChatGPT 4o prompt: ``` Following the similar PyArrow using Ozone Docker image tutorial, create a similar one for boto3 using the following instructions: ozone sh bucket create s3v/bucket Code #!/usr/bin/python import boto3 # Create a local file to upload with open("localfile.txt", "w") as f: f.write("Hello from Ozone via Boto3!\n") # Configure Boto3 client s3 = boto3.client( 's3', endpoint_url='http://weichiu-s3g-1:9878', aws_access_key_id='ozone-access-key', aws_secret_access_key='ozone-secret-key' ) # List buckets response = s3.list_buckets() print(response['Buckets']) # Upload a file s3.upload_file('localfile.txt', 'bucket', 'file.txt') # Download a file s3.download_file('bucket', 'file.txt', 'downloaded.txt’) ``` The httpfs receipe draft was generated using ChatGPT 4o, prompt: ``` Use the following instructions to create a tutorial of accessing Ozone using HttpFS REST API via requests library Ozone httpfs using Python requests # Download the latest Docker Compose configuration file curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml add to docker-compose.yaml: CORE-SITE.XML_fs.defaultFS: "ofs://om" CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" docker compose up -d --scale datanode=3 connect to the SCM container: ozone sh volume create vol1 ozone sh bucket create vol1/bucket1 pip install requests #!/usr/bin/python import requests # Ozone HTTPFS endpoint and file path host = "http://weichiu-httpfs-1:14000" volume = “vol1" bucket = "bucket1" filename = "hello.txt" path = f"/webhdfs/v1/{volume}/{bucket}/{filename}" user = "ozone" # can be any value in simple auth mode # Step 1: Initiate file creation (responds with 307 redirect) params_create = { "op": "CREATE", "overwrite": "true", "user.name": user } print("Creating file...") resp_create = requests.put(host + path, params=params_create, allow_redirects=False) if resp_create.status_code != 307: print(f"Unexpected response: {resp_create.status_code}") print(resp_create.text) exit(1) redirect_url = resp_create.headers['Location'] print(f"Redirected to: {redirect_url}") # Step 2: Write data to the redirected location with correct headers headers = {"Content-Type": "application/octet-stream"} content = b"Hello from Ozone HTTPFS!\n" resp_upload = requests.put(redirect_url, data=content, headers=headers) if resp_upload.status_code != 201: print(f"Upload failed: {resp_upload.status_code}") print(resp_upload.text) exit(1) print("File created successfully.") # Step 3: Read the file back params_open = { "op": "OPEN", "user.name": user } print("Reading file...") resp_read = requests.get(host + path, params=params_open, allow_redirects=True) if resp_read.ok: print("File contents:") print(resp_read.text) else: print(f"Read failed: {resp_read.status_code}") print(resp_read.text) ``` Note: initially the draft had PySpark content. Due to the length of the content, I decided to leave it out. Will work on it in a follow-up task. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-13165?filter=-1 ## How was this patch tested? After Gemini/ChatGPT generated the user doc draft, I manually followed the code samples and verified the steps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
