Re: [I] task(dataset): Redirect multipart upload through File Service [texera]

via GitHub Tue, 10 Mar 2026 22:40:46 -0700


xuang7 commented on issue #4110:
URL: https://github.com/apache/texera/issues/4110#issuecomment-4036622761


   @AnzhiZhang Thanks for taking the time to share such a thorough analysis. 
Here are some responses based on my understanding.
   
   1. On the threat model: You raise a fair point that authenticated users are 
generally trusted. That said, the presigned URL itself carries no user identity 
or size restriction. It only limits the target path and expiration time. Anyone 
who obtains the URL could upload files without Texera authentication, so that's 
part of the reason for adding another layer. The probability is low, but we 
felt the attack surface was worth addressing.
   2. On post-upload validation: This is a great suggestion, and we did 
consider it in the first iteration of file size enforcement. The concern was 
that by the time the finish step runs, a malicious user could have already 
consumed significant storage and bandwidth by uploading a very large file chunk 
by chunk. By proxying through `File Service`, we're able to enforce limits in 
real time during the upload rather than cleaning up after the fact.
   3. On `listParts()`: Agreed that `listParts()` is lightweight and useful for 
verification. There was some earlier discussion online around potential 
consistency issues with relying on it too heavily, so we chose not to build 
deeper implementation logic around it. It is a useful function that could help 
reduce redundant work in certain scenarios. The newly introduced `upload 
session/part table` also enables resumable uploads with more control compared 
to relying on `listParts()`, which was another motivation for tracking part 
state server-side.
   4. On the watcher approach: This is an interesting alternative and 
definitely worth considering. That said, it introduces additional complexity, 
and during the polling gap, a malicious upload could still consume substantial 
resources.
   
   We understand these are tradeoffs, and your alternative approaches are 
certainly reasonable depending on the deployment context. In our initial 
testing with files in the hundreds-of-GB range, the performance impact has been 
reasonable so far, but we plan to do more testing on the server to be sure. 
This is definitely something we can revisit if upload concurrency becomes a 
bottleneck in production.
   
   @carloea2  please feel free to correct me if anything is off. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] task(dataset): Redirect multipart upload through File Service [texera]

Reply via email to