wwj6591812 opened a new pull request, #8219:
URL: https://github.com/apache/paimon/pull/8219

   ### Background
   
   `blob-write-null-on-missing-file` already allows Flink writes to store 
`NULL` instead of failing when a descriptor BLOB points to a missing file on 
filesystem-based URIs (`file://`, `hdfs://`, `oss://`, etc.).
   
   However, for `http://` / `https://` URIs, `UriReaderFactory.exists()` 
previously always returned `true` for HTTP readers, so the pre-check never ran. 
As a result, missing HTTP resources (e.g. CDN image URLs returning 404) were 
only detected during the actual blob read in `BlobFormatWriter`, causing the 
whole write task to fail.
   
   This is a common issue when ingesting external image URLs into Paimon blob 
tables.
   
   ### Purpose
   
   This PR makes `blob-write-null-on-missing-file` work for HTTP/HTTPS 
descriptor URLs.
   
   **Primary path (pre-check):**
   - Add real HTTP existence checks via `HttpClientUtils.exists()` (HEAD first, 
fallback to lightweight `Range: bytes=0-0` GET when HEAD is not supported)
   - Wire `HttpUriReader.exists()` into `UriReaderFactory.exists()`
   - Reuse the existing Flink write path in `FlinkRowWrapper.isNullAt()` so 
missing HTTP descriptors are treated as `NULL` before write
   
   **Fallback path (write-time):**
   - In `BlobFormatWriter`, when `writeNullOnMissingFile` is enabled, catch 
HTTP 404 during blob read and write `NULL` with a warning instead of failing 
the task
   - Thread the option through `BlobFileContext` → `MultipleBlobFileWriter` / 
`ExternalStorageBlobWriter` → `BlobFileFormat`
   
   With this change, bad HTTP URLs are skipped gracefully while the job 
continues, and the failing URL is logged for troubleshooting.
   
   ### Tests
   
   - `HttpClientUtilsTest`
     - HTTP 200 → `exists() == true`
     - HTTP 404 → `exists() == false`
     - HEAD not allowed (405) → fallback GET check
     - `isNotFoundError()` helper
   - `UriReaderFactoryTest`
     - HTTP missing resource → `exists() == false`
     - HTTP available resource → `exists() == true`
   - `BlobFormatWriterTest`
     - write-path 404 fallback writes `NULL` when enabled
     - 404 still fails when option is disabled
   - `FlinkRowWrapperTest`
     - HTTP 404 descriptor → `isNullAt() == true` when checking enabled
     - HTTP 200 descriptor → `isNullAt() == false` when checking enabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to