[GitHub] [incubator-pinot] haibow opened a new issue #4610: Schema resolution at query time

GitBox Thu, 12 Sep 2019 15:20:22 -0700

haibow opened a new issue #4610: Schema resolution at query time
URL: https://github.com/apache/incubator-pinot/issues/4610
 
 
   Right now in `BrokerReduceService`, it will do a basic check on whether 
there are conflicting schemas across servers. If the row names don't match or 
some column fails type compatibility, it will just drop some segments.
   
   This has caused issues with dropping segments unexpectedly due to known bugs 
or operational errors, especially with schema evolution:
   - When updating the schema (e.g. add a new column) for a REALTIME table, 
even if we call RELOAD on the table, the segment being consumed at the time of 
schema update will end up with the old schema, while other old and new segments 
will have the new schema, causing that single segment to be dropped at query 
time due to schema mismatch 
(https://github.com/apache/incubator-pinot/issues/4225#issuecomment-521824275)
   - When updating the schema for OFFLINE tables with daily append cadence, if 
the user didn't backfill old data after changing the schema and devs did not 
RELOAD the table/old segments, segments would end up in a mix of old and new 
schemas, causing some segments to be dropped at query time as well.
   
   Possible solutions:
   1. Resolve schema at query time. When segments have different but 
potentially compatible schemas (e.g. found schema v1 and v2 in dataTableMap, 
and v2 has one extra column than v1), try to merge all segments in the result, 
e.g. add default values for the new column for segments with schema v1 on the 
fly. This might add query latency overhead, and need to be done at each query, 
since schema inconsistency in the underlying segments are still not fixed.
   2. Resolve schema in an asyn manner. We still keep the existing behavior of 
dropping segments with conflicting schemas, but when this happens, we will call 
controller to trigger RELOAD on select segments or the whole table, in an 
attempt to fix potential schema issues for future queries. Trigger conditions 
could be:
       - when not all segments have the same schema.
       - when the schema of some segments is different from the table schema. 
(I may have missed something, but seems at query time, the schema is inferred 
from segments, and the table schema in ZK is not referenced at all)
   
   For option 2, we might need some extra check to make sure there aren't 
memory constraints due to reloading unnecessarily/frequently, e.g. we will only 
reload tables in scenarios where different schemas are compatible (e.g. 
backward compatible, type compatible), and are confident reloading would 
resolve the schema inconsistencies as a one-time effort.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] haibow opened a new issue #4610: Schema resolution at query time

Reply via email to