jojochuang opened a new pull request, #8556:
URL: https://github.com/apache/ozone/pull/8556

   ## What changes were proposed in this pull request?
   HDDS-13165. [Docs] Python client developer guide.
   
   Please describe your PR in detail:
   * Added interface/Python.md: overall Python client access introduction.
   * Recipe: Access Ozone using PyArrow (Docker Quickstart)
   * Recipe: Access Ozone using Boto3 (Docker Quickstart)
   * Recipe: Access Ozone using HTTPFS REST API (Docker + Python Requests)
   
   For interface/Python.md, the draft was generated using ChatGPT 4o using the 
prompt:
   
   ```
   Create a user document in Markdown format for Python developers who want to 
access Apache Ozone. This document will be part of the Ozone Client Interfaces 
page: https://ozone.apache.org/docs/edge/interface.html.
   
   📌 *Audience*: Python developers familiar with Python integration and Ozone. 
Skip the introduction.
   
   📌 *Structure*:
   
   Setup and Prerequisites:
   Required libraries (PyArrow, Boto3, WebHDFS)
   Required configurations (e.g., HADOOP_CONF_DIR, Ozone URIs, credentials, 
authentication)
   Access Method 1: PyArrow with libhdfs
   Setup steps (including any system paths or environment variables)
   Python code sample (validate for correctness)
   Access Method 2: Boto3 with Ozone S3 Gateway
   Setup steps (including Ozone S3 endpoint format, bucket naming conventions, 
credentials)
   Python code sample (validate for correctness)
   Access Method 3: WebHDFS/HttpFS or REST API
   Setup steps (including endpoint URL, authentication)
   Python code sample (using requests or webhdfs)
   Access from PySpark
   Configuration settings in Spark (fs.ozone. settings)
   Python code sample for reading/writing data to Ozone
   Troubleshooting Tips
   Common issues (e.g., authentication failures, connection errors)
   Suggested debugging techniques
   References and Further Resources
   Links to official Ozone documentation, PyArrow, Boto3, WebHDFS, PySpark
   📌 *Markdown Format*:
   
   Use proper headers (##, ###) for each section.
   Include Python syntax highlighting in code blocks (```python).
   Use clear formatting and spacing for readability.
   Include warnings or notes where appropriate (e.g., > *Note:*).
   If applicable, include a simple diagram showing connection flows.
   📌 *Quality Checks*:
   
   Validate all code samples for correctness.
   Ensure the document is clear and concise.
   Focus only on actionable instructions and setup information.
   Generate the complete Markdown document in response. Include a Hugo header. 
Include Apache License header
   ```
   
   The PyArrow recipe draft was generated using ChatGPT 4o prompt:
   ```
   I personally verified the following steps using Ozone's Docker image. Please 
rewrite in a user tutorial format.
   
   PyArrow to access Ozone
   
   # Download the latest Docker Compose configuration file
   curl -O 
https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml
   
   docker compose up -d --scale datanode=3
   
   
   connect to the SCM container:
   
   docker exec -it weichiu-scm-1 bash
   
   ozone sh volume create volume
   ozone sh bucket create volume/bucket
   
   pip install pyarrow
   
   
   curl -L 
"https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz";
 | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’
   or
   curl -L 
"https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz";
 | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*’
   
   export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/
   export CLASSPATH=$(ozone classpath ozone-tools)
   
   Add to /etc/hadoop/core-site.xml
   
   <configuration>
           <property>
                   <name>fs.defaultFS</name>
                   <value>ofs://om:9862</value>
                   <description>Where HDFS NameNode can be found on the 
network</description>
           </property>
   </configuration>
   
   
   Code:
   
   #!/usr/bin/python
   import pyarrow.fs as pafs
   
   # Create Hadoop FileSystem object
   fs = pafs.HadoopFileSystem("default", 9864)
   
   fs.create_dir("volume/bucket/aaa")
   
   path = "volume/bucket/file1"
   with fs.open_output_stream(path) as stream:
           stream.write(b'data')
   ```
   
   The Boto3 recipe draft was generated using ChatGPT 4o prompt:
   ```
   Following the similar PyArrow using Ozone Docker image tutorial, create a 
similar one for boto3 using the following instructions:
   
   ozone sh bucket create s3v/bucket
   
   
   
   Code
   
   #!/usr/bin/python
   import boto3
   
   # Create a local file to upload
   with open("localfile.txt", "w") as f:
       f.write("Hello from Ozone via Boto3!\n")
   
   # Configure Boto3 client
   s3 = boto3.client(
   's3',
   endpoint_url='http://weichiu-s3g-1:9878',
   aws_access_key_id='ozone-access-key',
   aws_secret_access_key='ozone-secret-key'
   )
   
   # List buckets
   response = s3.list_buckets()
   print(response['Buckets'])
   
   # Upload a file
   s3.upload_file('localfile.txt', 'bucket', 'file.txt')
   
   # Download a file
   s3.download_file('bucket', 'file.txt', 'downloaded.txt’)
   ```
   
   The httpfs receipe draft was generated using ChatGPT 4o, prompt:
   
   ```
   Use the following instructions to create a tutorial of accessing Ozone using 
HttpFS REST API via requests library
   
   Ozone httpfs using Python requests
   
   
   
   # Download the latest Docker Compose configuration file
   curl -O 
https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml
   add to docker-compose.yaml:   CORE-SITE.XML_fs.defaultFS: "ofs://om"
      CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
      CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
   
   docker compose up -d --scale datanode=3
   
   
   connect to the SCM container:
   
   
   
   ozone sh volume create vol1 
   ozone sh bucket create vol1/bucket1
   
   pip install requests
   
   #!/usr/bin/python
   import requests
   
   # Ozone HTTPFS endpoint and file path
   host = "http://weichiu-httpfs-1:14000";
   volume = “vol1"
   bucket = "bucket1"
   filename = "hello.txt"
   path = f"/webhdfs/v1/{volume}/{bucket}/{filename}"
   user = "ozone"  # can be any value in simple auth mode
   
   # Step 1: Initiate file creation (responds with 307 redirect)
   params_create = {
       "op": "CREATE",
       "overwrite": "true",
       "user.name": user
   }
   
   print("Creating file...")
   resp_create = requests.put(host + path, params=params_create, 
allow_redirects=False)
   
   if resp_create.status_code != 307:
       print(f"Unexpected response: {resp_create.status_code}")
       print(resp_create.text)
       exit(1)
   
   redirect_url = resp_create.headers['Location']
   print(f"Redirected to: {redirect_url}")
   
   # Step 2: Write data to the redirected location with correct headers
   headers = {"Content-Type": "application/octet-stream"}
   content = b"Hello from Ozone HTTPFS!\n"
   
   resp_upload = requests.put(redirect_url, data=content, headers=headers)
   if resp_upload.status_code != 201:
       print(f"Upload failed: {resp_upload.status_code}")
       print(resp_upload.text)
       exit(1)
   print("File created successfully.")
   
   # Step 3: Read the file back
   params_open = {
       "op": "OPEN",
       "user.name": user
   }
   
   print("Reading file...")
   resp_read = requests.get(host + path, params=params_open, 
allow_redirects=True)
   if resp_read.ok:
       print("File contents:")
       print(resp_read.text)
   else:
       print(f"Read failed: {resp_read.status_code}")
       print(resp_read.text)
   ```
   
   Note: initially the draft had PySpark content. Due to the length of the 
content, I decided to leave it out. Will work on it in a follow-up task.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-13165?filter=-1
   
   ## How was this patch tested?
   
   After Gemini/ChatGPT generated the user doc draft, I manually followed the 
code samples and verified the steps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to