mormigil opened a new pull request, #19574:
URL: https://github.com/apache/druid/pull/19574

   ### Description
   
   All SQL/MSQ ingestion of the form `INSERT/REPLACE … SELECT … FROM 
TABLE(EXTERN(...))` that reads a **random-access input format** (Parquet, ORC, 
Avro-OCF, SQL, Druid-segment) from remote storage (e.g. S3) fails on the 
worker/peon with:
   
   ```
   Caused by: java.io.IOException: No such file or directory
       at java.base/java.io.UnixFileSystem.createFileExclusively(Native Method)
       at java.base/java.io.File.createTempFile(File.java:2170)
       at org.apache.druid.data.input.InputEntity.fetch(InputEntity.java)
       at 
org.apache.druid.data.input.parquet.ParquetReader.intermediateRowIterator(ParquetReader.java:86)
       at ... ExternalSegment ... ScanQueryFrameProcessor ...
   ```
   
   Streaming formats (JSON/CSV) and `index_kafka` ingestion are unaffected. 
This is a regression: it works in 32.x and breaks in 37.0.0.
   
   ### Root cause
   
   Random-access formats download each remote object to a local temp file via 
`InputEntity#fetch(temporaryDirectory, …)`, which calls 
`File.createTempFile(prefix, suffix, temporaryDirectory)`. `createTempFile` 
does **not** create parent directories. In the MSQ indexer worker the directory 
is derived lazily and never created:
   
   | Path | Created? | Where |
   |------|----------|-------|
   | `<taskWorkDir>/indexing-tmp` | ✅ | `TaskToolbox#getIndexingTmpDir` 
(`mkdirp`) |
   | `…/indexing-tmp/stage_NNNNNN` | ❌ | `IndexerFrameContext#tempDir` |
   | `…/stage_NNNNNN/external` | ❌ | `RunWorkOrder` → 
`frameContext.tempDir("external")` → `ExternalInputSliceReader` |
   
   Output channels work because `FileOutputChannelFactory#openChannel` already 
calls `FileUtils.mkdirp(...)` before writing; the input fetch path simply 
lacked the symmetric call. Streaming formats read via `InputEntity#open()` and 
never create a temp file, which is why only fetch-based formats regressed. This 
nesting was introduced by the background-fetch / virtual-storage external-input 
rewrite (#19539).
   
   ### Fix
   
   `mkdirp` the directory in `InputEntity#fetch` right before `createTempFile`, 
mirroring `FileOutputChannelFactory#openChannel`. The call is idempotent and 
covers every fetch-based input format.
   
   ### Verification
   
   Added Parquet coverage to `S3ExternQueryTest` (real embedded cluster + 
MinIO), exercising the actual indexer fetch path for both 
`backgroundFetchExternalFiles` on and off.
   
   - With the fix: all 4 cases pass.
   - Reverting the one-line fix: `test_externParquet_backgroundFetchDisabled` 
fails with `java.io.IOException: No such file or directory` (the direct-read 
path; the background-fetch path stages via a local `FileEntity` that skips 
`createTempFile`).
   
   <hr>
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to