JWuCines opened a new pull request, #19418:
URL: https://github.com/apache/druid/pull/19418
Fixes #19411.
### Description
When using `index_parallel` with `HdfsInputSource` on a Kerberized HDFS
cluster where the NameNode has KMS configured, the ingestion task unnecessarily
attempts to acquire a KMS delegation token. This happens because
`HdfsInputSource.getPaths()` uses `FileInputFormat.getSplits()` for path/glob
expansion, which internally calls `TokenCache.obtainTokensForNamenodes()`,
cascading into `KMSClientProvider.getDelegationToken()`. Druid's native
ingestion authenticates directly via Kerberos TGT and never needs these
delegation tokens.
#### Replaced FileInputFormat with direct FileSystem.globStatus() calls
Replaced the `FileInputFormat`/`Job`-based path expansion in
`HdfsInputSource.getPaths()` with direct `FileSystem.globStatus()` calls. This
achieves the same HDFS glob expansion without entering the MapReduce
`TokenCache` code path, eliminating the unnecessary KMS contact.
The inner `HdfsFileInputFormat` helper class and all
`org.apache.hadoop.mapreduce` imports have been removed. No other file in the
`druid-hdfs-storage` module references the MapReduce API.
#### Added unit tests for getPaths() edge cases
Added a new `GetPathsTest` inner class to `HdfsInputSourceTest` with three
tests:
- `testGetPathsWithGlobMatchingNoFiles` — glob matching no files returns an
empty collection
- `testGetPathsFiltersZeroLengthFiles` — zero-length files are excluded,
non-empty files are included
- `testGetPathsWithMultipleInputPaths` — multiple distinct glob patterns are
resolved correctly
#### Release note
Fixed an issue where `HdfsInputSource` with `index_parallel` unnecessarily
contacted KMS when using Kerberized HDFS, causing task failures if KMS was
unreachable. The fix replaces the internal use of Hadoop MapReduce
`FileInputFormat` for path expansion with direct `FileSystem.globStatus()`
calls.
<hr>
##### Key changed/added classes in this PR
* `HdfsInputSource`
* `HdfsInputSourceTest`
<hr>
This PR has:
- [x] been self-reviewed.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]