raghavyadav01 opened a new pull request, #18863: URL: https://github.com/apache/pinot/pull/18863
## What [#18806](https://github.com/apache/pinot/pull/18806) ("Replace `segmentDirectoryConfigs` map with `ReadMode` in `SegmentDirectoryLoaderContext`") refactored `DefaultSegmentDirectoryLoader.load` and inadvertently dropped the `segmentLoaderContext` argument when constructing `SegmentLocalFSDirectory`: ```java // Before #18806 return new SegmentLocalFSDirectory(directory, segmentLoaderContext, ReadMode.valueOf(segmentDirectoryConfigs.getProperty(IndexLoadingConfig.READ_MODE_KEY))); // After #18806 (current master) return new SegmentLocalFSDirectory(directory, segmentLoaderContext.getReadMode()); ``` The 2-arg `SegmentLocalFSDirectory(File, ReadMode)` constructor sets the underlying `SingleFileIndexDirectory#_segmentDirectoryLoaderContext` to `null`. This silently breaks the task-config propagation introduced in [#18264](https://github.com/apache/pinot/pull/18264) — `SingleFileIndexDirectory#serializeTaskConfigToJsonFromContext()` short-circuits at its first guard: ```java private String serializeTaskConfigToJsonFromContext() { if (_segmentDirectoryLoaderContext == null) { return null; // ← #18806 makes this always true on the default loader path } ... } ``` As a result, `task.config.json` is never injected into `EmptyIndexBuffer.properties`, and downstream consumers that depend on the table's task configuration to resolve their own configuration (for example, storage credentials for remote forward-index reads at preprocess time) silently fall back to ambient defaults (environment variables / IAM). ## Fix Restore the prior behavior by constructing `SegmentMetadataImpl` from the index directory and using the existing 4-arg `SegmentLocalFSDirectory(File, SegmentMetadataImpl, ReadMode, @Nullable SegmentDirectoryLoaderContext)` constructor. One file, +10/−2, no new constructor or API surface. ## How tested Validated end-to-end against a local cluster with isolated S3 backends (different access keys for source data vs the Pinot tier, so the environment-fallback path lands on the wrong endpoint and is observable): - **Without the fix** (current master): segments with a preprocess-time index update on a remote source column (e.g. a JSON index on a `JSON` column, a TEXT index on a `STRING` column) go to ERROR — `NoSuchBucket` because the parquet reader falls back to environment credentials at the global SDK endpoint instead of the per-source endpoint encoded in the task config. - **With the fix**: same setup, same segments load ONLINE; `JsonIndexHandler`, `TextIndexHandler`, and `ParquetNullValueIndexHandler` resolve credentials from the propagated task config and read source data successfully. Pre-fix stack signature: ``` SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline → BaseTableDataManager.downloadAndLoadSegment → ImmutableSegmentLoader.preprocessSegment → SegmentPreProcessor.process → {Json,Text,...}IndexHandler.updateIndices → {Struct,...}ParquetForwardIndexReader{V2,V6} → PageReader.readChunkForIndexing → software.amazon.awssdk.services.s3.model.NoSuchBucketException ``` ## Suggested labels `bugfix`, `release-notes` (regression in `1.6.0-SNAPSHOT` between #18264 and #18806). 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
