AnzhiZhang commented on issue #4110:
URL: https://github.com/apache/texera/issues/4110#issuecomment-4039089264

   @xuang7 Thanks for the prompt reply and additional context on the tradeoffs 
and the performance testing results. A few follow up thoughts:
   
   1. On presigned URL security: Presigned URLs are cryptographically signed 
and cannot be forged by someone, and we only issue them to authenticated users. 
The window of exposure is also limited by the expiration time. This risk is low 
enough for someone to obtain a URL without Texera authentication.
   
   2. On the watcher approach and bandwidth consumption: Agree that a watcher 
adds some complexity, but given a 50 MB chunk size and a polling interval of, 
e.g., once per minute, the maximum wasted bandwidth before detection is bounded 
and modest. This seems like a reasonable tradeoff compared to proxying 
everything.
   
   3. On `listParts()` and resumable uploads: I saw some consistency concerns 
in some discussions as well. On resumability, S3 multipart upload natively 
supports resuming by design. As long as the `uploadId` is retained, a client 
can pick up where it left off without any server managed session. As for upload 
progress tracking, it is only meaningful to the uploader and perhaps future 
admin users, both of whom are well intentioned. Client reported progress can be 
trusted for this purpose, so server-side part tracking isn't strictly necessary 
for that use case. That said, the session table is a reasonable design that 
could benefit future admin visibility, and could also support a future 
migration back to direct upload if needed.
   
   4. On making proxy an optional configuration: One design worth considering 
is treating direct upload as the default and server-side proxy as an option. A 
global config could allow fine grained control. For example, enabling proxy 
only for untrusted user tiers, or toggling it site wide. I assume Texera is 
often self-hosted in intranet data science computing environments where they do 
not have untrusted users, and CPU and memory are more constrained. For those 
deployments, direct upload is clearly the better tradeoff and makes full use of 
what presigned URLs are designed for.
   
   If there are prior discussions or constraints not aware of, happy to hear 
about them. Hope this is useful if the design is revisited down the line.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to