joellubi commented on code in PR #46194:
URL: https://github.com/apache/arrow/pull/46194#discussion_r2062699728


##########
docs/source/format/Flight.rst:
##########
@@ -369,6 +369,61 @@ string, so the obvious candidates are not compatible.  The 
chosen
 representation can be parsed by both implementations, as well as Go's
 ``net/url`` and Python's ``urllib.parse``.
 
+Extended Location URIs
+----------------------
+
+In addition to alternative transports, a server may also return
+URIs that reference an external service or object storage location.
+This can be useful in cases where intermediate data is cached as
+Apache Parquet files on S3 or is accessible via an HTTP service. In
+these scenarios, it is more efficient to be able to provide a URI
+where the client may simply download the data directly, rather than
+requiring a Flight service to read it back into memory and serve it
+from a ``DoGet`` request. Servers should use the following URI
+schemes for this situation:
+
++--------------------+------------------------+
+| Location           | URI Scheme             |
++====================+========================+
+| Object storage (1) | s3:, gcs:, abfs:, etc. |

Review Comment:
   It seems like handling the different URI standards and implementations for 
cloud providers forces a tradeoff between:
   - Heavy burden for implementations to support all protocols, so that clients 
can remain universal, or...
   - No expectation for any particular scheme to be supported, but then 
interoperability suffers
   
   I wonder if we're approaching this the best way. We're introducing 
variability for **protocol** (e.g. `s3`, `gs`) and **format** (e.g. `arrow`, 
`parquet`), but I don't think that `s3` and `gs` are actually different 
protocols. These object stores ultimately expose REST APIs, so I think the only 
**protocols** we're really adding here are HTTP/HTTPS.
   
   I think that if we subtly adjust the proposal we can increase the simplicity 
and interoperability for clients substantially. Specifically I would propose 
that servers respond with `http://` or `https://` URIs, from which the client 
can retrieve results simply with a GET request. Auth is handled via presigned 
urls or mediated separately. We would continue to use `Accept` and 
`Content-Type` to mediate content negotiation for the data format, which works 
well with simple HTTP.
   
   In this case only the server would need to know the semantics of the object 
store APIs it supports. For example, instead of responding with 
`s3://amzn-s3-demo-bucket/test2.arrow` the server would be responsible for 
constructing the full URL 
`https://amzn-s3-demo-bucket.s3.us-west-2.amazonaws.com/...`.
   
   IMO this keeps us closer to "generic clients" but still provides server 
implementations with great flexibility for content delivery. It actually gives 
servers more control to set specific URI parameters, and prevents clients (or 
Arrow maintainers) from having to handle the complexity of external cloud 
vendors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to