Re: [PR] refactor(dataset): Redirect multipart upload through File Service [texera]

via GitHub Wed, 10 Dec 2025 23:42:31 -0800


carloea2 commented on PR #4130:
URL: https://github.com/apache/texera/pull/4130#issuecomment-3640648605


   ---
   
   **Why I prefer backing multipart uploads with a DB**
   
   I am proposing to store multipart upload state in a DB instead of keeping 
everything fully stateless, for several reasons:
   
   1. **Clear concurrency guarantees per `(uploadId, partNumber)`**
   
      Both Huawei OBS and Amazon S3 allow multiple uploads to the same 
`uploadId` + `partNumber` and simply apply last-write-wins semantics. OBS is 
very explicit about concurrent streams: its `UploadPart` API states that if you 
repeatedly upload the same `partNumber`, the last upload overwrites the 
previous one, and when there are multiple concurrent uploads for the same 
`partNumber` of the same object, the server uses a Last Write Win policy and 
recommends locking concurrent uploads of the same part on the client side to 
ensure data accuracy. S3’s multipart upload docs similarly say that if you 
upload a new part using a part number that was used before, the previously 
uploaded part is overwritten. In other words, neither backend prevents 
concurrent writers to the same part; they just pick a winner.
   
      With a DB row keyed by `(upload_id, part_number)` we can use row-level 
locking to enforce “only one active stream can write this part” in a way that 
is explicit and portable, regardless of which storage backend we are using. 
This is useful both for well-behaved clients (we can give strong semantics) and 
for protecting ourselves from obvious waste (e.g., the same upload repeatedly 
opening the same part). In a multi-node/Kubernetes deployment, in-memory locks 
or sticky sessions are fragile or operationally complex; a shared DB is the 
simplest coordination point we already have. Without a DB we would need to 
complicate the API (for example `/uploadPart/{uploadToken}/{partNumber}` plus 
client-side coordination and disabled concurrency on that endpoint) and we 
still would not have a single canonical place where we enforce this rule. Note 
that this is complementary to, not a replacement for, rate limiting and abuse 
protection at other layers.
   
   2. **Efficient and well-scoped listing, aligned with S3 guidance**
   
      If the client stores the raw `uploadId` and we rely purely on the 
S3-compatible APIs, completing an upload means calling `ListParts` or 
`ListMultipartUploads` and filtering:
   
      * `ListMultipartUploads` lists in-progress multipart uploads for the 
entire bucket, up to 1000 at a time.
      * `ListParts` lists the parts for a specific multipart upload, given its 
`uploadId`.
   
      In our lakeFS setup I observed that `ListMultipartUploads` behaves 
effectively repo-wide, and the prefix filtering we would like (for something 
like `/dataset-1/...`) does not give us the narrow, efficient listing we want. 
As more uploads accumulate under the same repo, listing gets slower and less 
predictable. On top of that, the S3 Developer Guide explicitly says that the 
multipart part listing is not supposed to be your source of truth for 
completing an upload and that you should maintain your own list of part numbers 
and ETags. A DB table is exactly that “own list”: an indexed, well-scoped 
record of parts per logical upload that we can query efficiently and 
deterministically, independent of how many uploads exist in the bucket or repo, 
and independent of quirks of the underlying S3/lakeFS implementation.
   
   3. **Opaque upload tokens vs exposing `uploadId` (and the encrypted token 
alternative)**
   
      I would like to avoid exposing the raw S3 `uploadId` directly to clients. 
It is a backend-specific identifier, and once clients depend on it we are 
coupled to a specific storage implementation and data model.
   
      There are two designs we could follow:
   
      * **Encrypted token approach, client as temporary storage**
        We could encode `uploadId` and any metadata (path, repo, limits, 
timestamps, etc.) into an encrypted and signed token, give that to the client, 
and let the client act as our temporary storage. On each request, the client 
sends the token back, we decrypt and verify it, and recover the `uploadId`. 
This works, but it has tradeoffs:
   
        * The token grows as we pack more metadata inside.
        * Any schema change or new metadata field requires token versioning and 
migration logic.
        * Server-side changes such as aborting an upload, tightening limits, or 
invalidating uploads become harder, because the server no longer owns the 
authoritative state and must handle multiple token versions that may be “in the 
wild”.
   
      * **DB-backed opaque token (preferred)**
        With a DB, we can generate a short-lived opaque upload token that 
simply references a row where we store the real `uploadId` plus metadata. The 
client never sees `uploadId` directly, and the token itself can stay small and 
stable. All evolution (new columns, extra constraints, per-user quotas, flags, 
counters) happens in the DB without changing the client contract. Invalidating 
or aborting an upload is also as simple as flipping state in the DB.
   
      Both approaches keep `uploadId` hidden, but the DB-backed option is 
simpler to reason about, more flexible as we add features, and easier to revoke 
or invalidate than encoding everything into an ever-growing token.
   
   4. **Enforcing max file size or resource limits without hammering the DB**
   
      If we want a hard limit on the total bytes an upload can consume, we need 
to enforce that while data is streaming, not only at the very end when we call 
`CompleteMultipartUpload`. The multipart upload docs already recommend 
maintaining our own mapping from part numbers to ETags for completion; that 
same per-upload metadata is a natural place to store limits.
   
      My preferred approach is:
   
      * The frontend tells us the intended total file size and desired number 
of parts.
      * The backend computes a per-part limit `maxPartSize = fileSize / 
numParts` and stores it in the DB as metadata for that upload.
      * When the client uploads part X, we require `Content-Length` and verify 
that it is less than or equal to `maxPartSize` for this upload before streaming 
to S3. We can also keep a local counter for the current request and immediately 
cut the stream once it exceeds `maxPartSize`, so the enforcement is truly 
“hard” from our perspective.
   
      This reduces DB work to O(number of parts) checks instead of O(number of 
chunks). For example, 1 GiB split into 10 parts of 100 MiB each, streamed in 8 
KiB chunks, would otherwise require 100 000+ DB `addAndGet` calls if we tried 
to maintain a global accumulator per chunk. Here we do at most 10 DB-backed 
checks, because we only need to validate each part once. Since `maxPartSize` 
and any retry counters live server-side in the DB (not in a client-controlled 
token), clients cannot bump their own limit or reset their own attempt counters.
   
      **If we do not have `fileSize` and `numParts` from the frontend** we have 
strictly worse options:
   
      * Maintain a global per-upload byte counter and update it on every chunk: 
this is a very hot DB row and write-heavy.
      * Only update the counter per completed part: cheaper on the DB, but then 
we only detect a violation after the part has already consumed bandwidth and 
CPU.
      * Use a fixed global `maxPartSize` based on backend limits (for example 5 
GiB S3 part size) and derive a maximum number of parts from an overall limit: 
this is coarse and either rejects otherwise valid uploads or still requires 
tracking remaining budget per upload in a more complex way.
   
      Also, if we enforce a global max part size equal to S3’s limit (5 GiB), 
then many realistic uploads will be restricted to a single part and therefore 
lose the benefit of concurrent part uploads entirely. With the DB-backed 
per-upload `maxPartSize` we can still enforce a hard overall limit while 
allowing the client to choose a higher degree of concurrency for objects that 
are much smaller than 5 GiB.
   
      Finally, since this metadata lives in the DB, we can track additional 
abuse-related signals per upload (e.g., how many times an upload has been 
started and cancelled right before completion) and cap retries or detect 
obviously wasteful patterns without continuously growing a token or 
round-tripping more and more encoded state to the client.
   
   5. **Operational benefits and future features**
   
      A DB row per upload also gives us:
   
      * Better observability: we can see which uploads exist, their status, 
size, parts, timestamps, owners, and retry history without scraping object 
store listings.
      * Easier cleanup: we can implement “abort incomplete uploads older than N 
hours or days” using our DB, instead of scanning all multipart uploads in the 
bucket and reverse-mapping them to our logical repos.
      * A natural place to attach future features such as per-user quotas, 
project-level limits, resumable uploads, server-driven cancellation, or 
finer-grained policies, all without changing the external client-visible API or 
token format.
   
      Rate limiting, WAF rules, and other perimeter protections are still 
necessary for dealing with truly malicious clients, but having upload state 
modeled explicitly in a DB gives us a clean, consistent substrate for 
correctness, per-upload policies, and evolution over time.
   
   ---
   @aicam 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] refactor(dataset): Redirect multipart upload through File Service [texera]

Reply via email to