Re: [PR] [http] Add indirect HTTP GET examples [arrow-experiments]

via GitHub Fri, 29 Nov 2024 11:26:27 -0800


paleolimbot commented on code in PR #44:
URL: https://github.com/apache/arrow-experiments/pull/44#discussion_r1863870656



##########
http/get_indirect/python/client/README.md:
##########
@@ -0,0 +1,32 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# HTTP GET Arrow Data: Indirect Python Client Example with `requests`
+
+This directory contains an example of an HTTP client implemented in Python 
using the built-in 
[`urllib.request`](https://docs.python.org/3/library/urllib.request.html) 
module. The client:
+1. Sends a GET request to the server to get a JSON listing of the filenames of 
available `.arrows` files.
+2. Sends GET requests to download each of the `.arrows` files from the server.
+3. Loads the contents of each file into an in-memory PyArrow Table.
+

Review Comment:
   It would be nice to see the actual JSON here. From a mental parse of the 
client I think it's supposed to be:
   
   ```json
   {
     "arrow_stream_files": ["foofy1.arrows", "foofy2.arrows", ...]
   }
   ```
   
   If I were trying to implement an efficient scanner, I would actually want a 
schema and a number of rows right out of the gate:
   
   ```json
   {
     "data" : {
       "arrow_schema": "foofy_schema.arrows",
       "arrow_streams": ["foofy1.stream", "foofy2.stream"],
       "arrow_statistics": "foofy_stats.arrows" // row_count is probably 
sufficient for most cases
     }
   }
   ```
   
   I would also want those to be full URIs so my client doesn't have to guess 
whether the content is an absolute path, a relative path, or a URI. This also 
opens the door for things like the `arrow_schema` to be a data URI 
(`data:application/vnd.apache.arrow.stream;base64,...`), which I believe 
`requests.get()` can handle.
   
   (I know we've chatted about not inlining data into these JSON responses, but 
avoiding client/server roundtrips is a real concern that pretty much every 
database programmer I've ever talked to has, so it might be worth addressing 
this).
   
   All of this is getting dangerously close to standardizing that protocol 
(which would be valuable, but maybe not what you are trying to do).



##########
http/get_indirect/python/client/client.py:
##########
@@ -0,0 +1,50 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import requests
+import json
+import os
+import pyarrow as pa
+
+
+HOST = "http://localhost:8008/";
+
+JSON_FORMAT = "application/json"
+ARROW_STREAM_FORMAT = "application/vnd.apache.arrow.stream"
+
+json_response = requests.get(HOST)
+
+if json_response.status_code == 200 and JSON_FORMAT in 
json_response.headers.get('Content-Type', ''):

Review Comment:
   optional nit: As a matter of personal style, I would probably write this as:
   
   ```python
   if json_response.status_code == 200:
       raise ValueError("Unexpected status")
   if JSON_FORMAT in json_response.headers.get('Content-Type', ''):
       raise ValueError("Unexpected Content-Type")
   ```



##########
http/get_indirect/python/client/client.py:
##########
@@ -0,0 +1,50 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import requests
+import json
+import os
+import pyarrow as pa
+
+
+HOST = "http://localhost:8008/";
+
+JSON_FORMAT = "application/json"
+ARROW_STREAM_FORMAT = "application/vnd.apache.arrow.stream"
+
+json_response = requests.get(HOST)
+
+if json_response.status_code == 200 and JSON_FORMAT in 
json_response.headers.get('Content-Type', ''):
+    parsed_data = json_response.json()
+    filenames = [file['filename'] for file in 
parsed_data['arrow_stream_files']]
+    uris = [HOST + filename for filename in filenames]
+    print("Downloaded and parsed JSON file listing. Found " + 
str(len(filenames)) + " files.")
+else:
+    raise ValueError("Unexpected Content-Type")
+
+tables = {}
+
+for uri in uris:

Review Comment:
   Your curl example has downloading in parallel...I'm not sure you want to add 
that indirection here, but if you do, you could use a 
`concurrent.futures.ThreadPoolExecutor()` or `ProcessPoolExecutor()` (the 
example in the documentation is pretty close to what you are doing here: 
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
 ).



##########
http/get_indirect/python/server/server.py:
##########
@@ -0,0 +1,50 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from http.server import SimpleHTTPRequestHandler, HTTPServer
+import json
+import os
+import mimetypes
+
+mimetypes.add_type("application/vnd.apache.arrow.stream", ".arrows")
+
+class MyServer(SimpleHTTPRequestHandler):
+    def list_directory(self, path):
+        try:
+            dir_list = [
+                f for f in os.listdir(path)
+                if f.endswith(".arrows") and os.path.isfile(os.path.join(path, 
f))
+            ]
+        except OSError:
+            self.send_error(404, "No permission to list directory")
+            return None
+
+        file_list = {"arrow_stream_files": [{"filename": f} for f in dir_list]}

Review Comment:
   `headers.get('Host')` would get you the host here sufficient for example 
purposes if you do go with the "full URI" suggestion above (no worries if you 
don't!). (In practice it's a little more complicated than that because this 
server could be sitting behind a reverse proxy like nginx and I think it's a 
slightly different header field).



##########
http/get_indirect/python/client/README.md:
##########
@@ -0,0 +1,32 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# HTTP GET Arrow Data: Indirect Python Client Example with `requests`
+
+This directory contains an example of an HTTP client implemented in Python 
using the built-in 
[`urllib.request`](https://docs.python.org/3/library/urllib.request.html) 
module. The client:

Review Comment:
   I see below you're using `requests`, which is not the built in 
`urllib.request`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [http] Add indirect HTTP GET examples [arrow-experiments]

Reply via email to