sourabh-27 opened a new pull request, #18831:
URL: https://github.com/apache/pinot/pull/18831

   
   ## Description
   
   Solves [#18808](https://github.com/apache/pinot/issues/18808)
   
   Apache Pinot's schema-update API has been strictly additive. The `PUT 
/schemas/{schemaName}` endpoint rejects any update that removes a column to 
prevent accidental backward-incompatible changes. The only workaround was using 
`force=true`, which is unsafe as it bypasses all structural validation. 
Furthermore, Pinot lacked a mechanism to reclaim on-disk space occupied by 
removed columns in existing, already-built segments.
   
   This PR introduces guard-railed support for deleting columns from a schema, 
divided into two independent, opt-in tiers:
   
   ### 1. Logical Deletion (Controller / Schema API)
   A new `allowColumnDeletion` query parameter is introduced to the 
schema-update endpoints. 
   * **Behavior:** When set to `true`, callers can intentionally drop columns 
present in the old schema but absent from the new one. 
   * **Safety Guards:** All other backward-compatibility rules remain same 
(e.g., changes to column types or primary keys are still rejected).
   
   
   ### 2. Physical Reclamation (Server / Segment Reload)
   A config flag, `reclaimDeletedColumnsOnReload`, dictates whether data for 
ingested columns missing from the schema is physically purged from segments 
during a reload operation.
   * **Behavior:** Previously, only auto-generated default columns were cleaned 
up; ingested column data persisted indefinitely. When this flag is enabled (set 
to `true`), a segment reload explicitly drops the forward index, dictionary, 
and all auxiliary indexes for columns no longer present in the schema, freeing 
up disk space.
   
   
   ### 3. Query Layer Behavior (Unchanged)
   * **Behavior:** Queries referencing a column that has been deleted from the 
schema will **throw an error**. This remains the consistent, standard behavior 
alongside these changes.
   
   
   ---
   
   ## Changes
   
   ### `pinot-controller`
   * **`PinotSchemaRestletResource`**: Added the `allowColumnDeletion` query 
parameter (default: `false`) to both the multipart and JSON `PUT 
/schemas/{schemaName}` endpoints. 
   * **`PinotHelixResourceManager`**: Passed the `allowColumnDeletion` 
parameter down into `updateSchema(...)`.
   
   ### `pinot-spi`
   * **`Schema`**: Overloaded `isBackwardCompatibleWith(Schema oldSchema, 
boolean allowColumnDeletion)`. The original single-argument method signature 
remains intact.
   * **`IndexingConfig`**: Added the `reclaimDeletedColumnsOnReload` option 
(default: `false`).
   
   ### `pinot-segment-local`
   * **`IndexLoadingConfig`**: Exposed the `isReclaimDeletedColumnsOnReload()` 
configuration property.
   * **`BaseDefaultColumnHandler`**: Updated to compute `REMOVE` actions for 
ingested columns absent from the schema when the reclamation flag is active 
(extending the existing auto-generated column removal logic).
   
   ### `pinot-clients`
   * **`SchemaAdminClient`**: Overloaded `updateSchema(..., boolean 
allowColumnDeletion)`.
   
   ---
   
   ## Testing
   - `./mvnw -pl pinot-spi -am -Dtest=SchemaTest 
-Dsurefire.failIfNoSpecifiedTests=false test`
   - `./mvnw -pl pinot-controller -am -Dtest=PinotSchemaRestletResourceTest 
-Dsurefire.failIfNoSpecifiedTests=false test`
   - `./mvnw -pl pinot-segment-local -am 
-Dtest=DefaultColumnHandlerTest,SegmentPreProcessorTest 
-Dsurefire.failIfNoSpecifiedTests=false test`
   - `./mvnw -pl pinot-integration-tests -am 
-Dtest=OfflineClusterIntegrationTest#testSchemaColumnDeletion 
-Dsurefire.failIfNoSpecifiedTests=false test`
   - `./mvnw spotless:apply -pl 
pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests`
   - `./mvnw license:format -pl 
pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests`
   - `./mvnw checkstyle:check -pl 
pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests`
   - `./mvnw license:check -pl 
pinot-spi,pinot-controller,pinot-segment-local,pinot-clients,pinot-integration-tests`
   - `git diff --check`
   
   ---
   
   ## Release Notes
   
   * **New Schema-Update Option:** `PUT /schemas/{schemaName}` endpoints now 
accept an optional `allowColumnDeletion` query parameter (default: `false`). 
When `true`, columns omitted from the new schema are safely dropped, provided 
they are not actively referenced by any table configuration. Structural type 
and primary-key compatibility assertions remain strictly enforced.
   * **New Table Indexing Configuration:** Added 
`indexingConfig.reclaimDeletedColumnsOnReload` (default: `false`). When 
activated, ingested columns omitted from the schema are physically wiped from 
segments upon reload to reclaim storage space.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to