JWuCines opened a new pull request, #19418:
URL: https://github.com/apache/druid/pull/19418

   Fixes #19411.
   
   ### Description
   
   When using `index_parallel` with `HdfsInputSource` on a Kerberized HDFS 
cluster where the NameNode has KMS configured, the ingestion task unnecessarily 
attempts to acquire a KMS delegation token. This happens because 
`HdfsInputSource.getPaths()` uses `FileInputFormat.getSplits()` for path/glob 
expansion, which internally calls `TokenCache.obtainTokensForNamenodes()`, 
cascading into `KMSClientProvider.getDelegationToken()`. Druid's native 
ingestion authenticates directly via Kerberos TGT and never needs these 
delegation tokens.
   
   #### Replaced FileInputFormat with direct FileSystem.globStatus() calls
   
   Replaced the `FileInputFormat`/`Job`-based path expansion in 
`HdfsInputSource.getPaths()` with direct `FileSystem.globStatus()` calls. This 
achieves the same HDFS glob expansion without entering the MapReduce 
`TokenCache` code path, eliminating the unnecessary KMS contact.
   
   The inner `HdfsFileInputFormat` helper class and all 
`org.apache.hadoop.mapreduce` imports have been removed. No other file in the 
`druid-hdfs-storage` module references the MapReduce API.
   
    #### Added unit tests for getPaths() edge cases
   
   Added a new `GetPathsTest` inner class to `HdfsInputSourceTest` with three 
tests:
   - `testGetPathsWithGlobMatchingNoFiles` — glob matching no files returns an 
empty collection
   - `testGetPathsFiltersZeroLengthFiles` — zero-length files are excluded, 
non-empty files are included
   - `testGetPathsWithMultipleInputPaths` — multiple distinct glob patterns are 
resolved correctly
   
   #### Release note
   
   Fixed an issue where `HdfsInputSource` with `index_parallel` unnecessarily 
contacted KMS when using Kerberized HDFS, causing task failures if KMS was 
unreachable. The fix replaces the internal use of Hadoop MapReduce 
`FileInputFormat` for path expansion with direct `FileSystem.globStatus()` 
calls.
   
   <hr>
   
   ##### Key changed/added classes in this PR
   * `HdfsInputSource`
   * `HdfsInputSourceTest`
   
   <hr>
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to