paleolimbot commented on code in PR #44: URL: https://github.com/apache/arrow-experiments/pull/44#discussion_r1863870656
########## http/get_indirect/python/client/README.md: ########## @@ -0,0 +1,32 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# HTTP GET Arrow Data: Indirect Python Client Example with `requests` + +This directory contains an example of an HTTP client implemented in Python using the built-in [`urllib.request`](https://docs.python.org/3/library/urllib.request.html) module. The client: +1. Sends a GET request to the server to get a JSON listing of the filenames of available `.arrows` files. +2. Sends GET requests to download each of the `.arrows` files from the server. +3. Loads the contents of each file into an in-memory PyArrow Table. + Review Comment: It would be nice to see the actual JSON here. From a mental parse of the client I think it's supposed to be: ```json { "arrow_stream_files": ["foofy1.arrows", "foofy2.arrows", ...] } ``` If I were trying to implement an efficient scanner, I would actually want a schema and a number of rows right out of the gate: ```json { "data" : { "arrow_schema": "foofy_schema.arrows", "arrow_streams": ["foofy1.stream", "foofy2.stream"], "arrow_statistics": "foofy_stats.arrows" // row_count is probably sufficient for most cases } } ``` I would also want those to be full URIs so my client doesn't have to guess whether the content is an absolute path, a relative path, or a URI. This also opens the door for things like the `arrow_schema` to be a data URI (`data:application/vnd.apache.arrow.stream;base64,...`), which I believe `requests.get()` can handle. (I know we've chatted about not inlining data into these JSON responses, but avoiding client/server roundtrips is a real concern that pretty much every database programmer I've ever talked to has, so it might be worth addressing this). All of this is getting dangerously close to standardizing that protocol (which would be valuable, but maybe not what you are trying to do). ########## http/get_indirect/python/client/client.py: ########## @@ -0,0 +1,50 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import requests +import json +import os +import pyarrow as pa + + +HOST = "http://localhost:8008/" + +JSON_FORMAT = "application/json" +ARROW_STREAM_FORMAT = "application/vnd.apache.arrow.stream" + +json_response = requests.get(HOST) + +if json_response.status_code == 200 and JSON_FORMAT in json_response.headers.get('Content-Type', ''): Review Comment: optional nit: As a matter of personal style, I would probably write this as: ```python if json_response.status_code == 200: raise ValueError("Unexpected status") if JSON_FORMAT in json_response.headers.get('Content-Type', ''): raise ValueError("Unexpected Content-Type") ``` ########## http/get_indirect/python/client/client.py: ########## @@ -0,0 +1,50 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import requests +import json +import os +import pyarrow as pa + + +HOST = "http://localhost:8008/" + +JSON_FORMAT = "application/json" +ARROW_STREAM_FORMAT = "application/vnd.apache.arrow.stream" + +json_response = requests.get(HOST) + +if json_response.status_code == 200 and JSON_FORMAT in json_response.headers.get('Content-Type', ''): + parsed_data = json_response.json() + filenames = [file['filename'] for file in parsed_data['arrow_stream_files']] + uris = [HOST + filename for filename in filenames] + print("Downloaded and parsed JSON file listing. Found " + str(len(filenames)) + " files.") +else: + raise ValueError("Unexpected Content-Type") + +tables = {} + +for uri in uris: Review Comment: Your curl example has downloading in parallel...I'm not sure you want to add that indirection here, but if you do, you could use a `concurrent.futures.ThreadPoolExecutor()` or `ProcessPoolExecutor()` (the example in the documentation is pretty close to what you are doing here: https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example ). ########## http/get_indirect/python/server/server.py: ########## @@ -0,0 +1,50 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from http.server import SimpleHTTPRequestHandler, HTTPServer +import json +import os +import mimetypes + +mimetypes.add_type("application/vnd.apache.arrow.stream", ".arrows") + +class MyServer(SimpleHTTPRequestHandler): + def list_directory(self, path): + try: + dir_list = [ + f for f in os.listdir(path) + if f.endswith(".arrows") and os.path.isfile(os.path.join(path, f)) + ] + except OSError: + self.send_error(404, "No permission to list directory") + return None + + file_list = {"arrow_stream_files": [{"filename": f} for f in dir_list]} Review Comment: `headers.get('Host')` would get you the host here sufficient for example purposes if you do go with the "full URI" suggestion above (no worries if you don't!). (In practice it's a little more complicated than that because this server could be sitting behind a reverse proxy like nginx and I think it's a slightly different header field). ########## http/get_indirect/python/client/README.md: ########## @@ -0,0 +1,32 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# HTTP GET Arrow Data: Indirect Python Client Example with `requests` + +This directory contains an example of an HTTP client implemented in Python using the built-in [`urllib.request`](https://docs.python.org/3/library/urllib.request.html) module. The client: Review Comment: I see below you're using `requests`, which is not the built in `urllib.request`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
