xuang7 commented on issue #4110: URL: https://github.com/apache/texera/issues/4110#issuecomment-4036622761
@AnzhiZhang Thanks for taking the time to share such a thorough analysis. Here are some responses based on my understanding. 1. On the threat model: You raise a fair point that authenticated users are generally trusted. That said, the presigned URL itself carries no user identity or size restriction. It only limits the target path and expiration time. Anyone who obtains the URL could upload files without Texera authentication, so that's part of the reason for adding another layer. The probability is low, but we felt the attack surface was worth addressing. 2. On post-upload validation: This is a great suggestion, and we did consider it in the first iteration of file size enforcement. The concern was that by the time the finish step runs, a malicious user could have already consumed significant storage and bandwidth by uploading a very large file chunk by chunk. By proxying through `File Service`, we're able to enforce limits in real time during the upload rather than cleaning up after the fact. 3. On `listParts()`: Agreed that `listParts()` is lightweight and useful for verification. There was some earlier discussion online around potential consistency issues with relying on it too heavily, so we chose not to build deeper implementation logic around it. It is a useful function that could help reduce redundant work in certain scenarios. The newly introduced `upload session/part table` also enables resumable uploads with more control compared to relying on `listParts()`, which was another motivation for tracking part state server-side. 4. On the watcher approach: This is an interesting alternative and definitely worth considering. That said, it introduces additional complexity, and during the polling gap, a malicious upload could still consume substantial resources. We understand these are tradeoffs, and your alternative approaches are certainly reasonable depending on the deployment context. In our initial testing with files in the hundreds-of-GB range, the performance impact has been reasonable so far, but we plan to do more testing on the server to be sure. This is definitely something we can revisit if upload concurrency becomes a bottleneck in production. @carloea2 please feel free to correct me if anything is off. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
