AnzhiZhang commented on issue #4110: URL: https://github.com/apache/texera/issues/4110#issuecomment-4039089264
@xuang7 Thanks for the prompt reply and additional context on the tradeoffs and the performance testing results. A few follow up thoughts: 1. On presigned URL security: Presigned URLs are cryptographically signed and cannot be forged by someone, and we only issue them to authenticated users. The window of exposure is also limited by the expiration time. This risk is low enough for someone to obtain a URL without Texera authentication. 2. On the watcher approach and bandwidth consumption: Agree that a watcher adds some complexity, but given a 50 MB chunk size and a polling interval of, e.g., once per minute, the maximum wasted bandwidth before detection is bounded and modest. This seems like a reasonable tradeoff compared to proxying everything. 3. On `listParts()` and resumable uploads: I saw some consistency concerns in some discussions as well. On resumability, S3 multipart upload natively supports resuming by design. As long as the `uploadId` is retained, a client can pick up where it left off without any server managed session. As for upload progress tracking, it is only meaningful to the uploader and perhaps future admin users, both of whom are well intentioned. Client reported progress can be trusted for this purpose, so server-side part tracking isn't strictly necessary for that use case. That said, the session table is a reasonable design that could benefit future admin visibility, and could also support a future migration back to direct upload if needed. 4. On making proxy an optional configuration: One design worth considering is treating direct upload as the default and server-side proxy as an option. A global config could allow fine grained control. For example, enabling proxy only for untrusted user tiers, or toggling it site wide. I assume Texera is often self-hosted in intranet data science computing environments where they do not have untrusted users, and CPU and memory are more constrained. For those deployments, direct upload is clearly the better tradeoff and makes full use of what presigned URLs are designed for. If there are prior discussions or constraints not aware of, happy to hear about them. Hope this is useful if the design is revisited down the line. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
