[I] Middle Managers return nonsensical results for GroupBy and Timeseries queries when Schema Auto-Discovery is enabled (druid)

via GitHub Tue, 17 Oct 2023 18:50:33 -0700


funguy-tech opened a new issue, #15191:
URL: https://github.com/apache/druid/issues/15191


   ### Affected Version
   
   V27.0.0
   
   ### Impact
   
   This issue appears to be reliably reproduced by executing a 
single-dimension, single-filter Native Druid query on any `string` dimension in 
a `kinesis` ingestion task that is derived from a `Schema Auto-Discovery` spec, 
as long as the data has not been handed off. The issue resolves after hand-off 
to Historicals.
   
   ### Expected Result
   
   GroupBy and Timeseries Queries against actively ingested single dimension 
values are consistently filtered without regard to data residency (realtime vs 
fully persisted segment).
   
   ### Actual Result
   
   GroupBy and Timeseries Queries against actively ingested single dimension 
values temporarily ignore or mis-apply filters until data segments are 
persisted, at which point filters are correctly applied.
   
   ### Description
   
   My team operates multiple large-scale Druid clusters with roughly identical 
base configurations. Pertinent details are as follows:
   
   - Ingestion Method: `kinesis`
   - Segment size: `1 hour`
   - Lookback period: `3 hours` (a small portion of our data is late-arriving)
   - Relevant Middle Manager architecture: ARM processors, statically defined 
hardware, dedicated to kinesis ingestion tasks
     - Other Middle Manager tasks, such as compaction, are delegated to a 
separate Middle Manager tier
   
   As part of Schema Auto-discovery migration, we migrated one of our regions 
to a new schema in which we only define a few legacy lists (to retain them as 
MVDs) and aggregations - the rest of our fields are ingested via discovery. In 
total, we produce records with ~100-150 fields, and the dataTypes do appear to 
align correctly post-migration.
   
   In the process of migrating, we stumbled across a perplexing issue with 
GroupBy and Timeseries queries. Whenever we perform a single dimension query 
that overlaps/involves data on the Middle Managers (in our case, queries that 
touch the most recent 3 hours), the results received are nonsensical - the 
filter appears to be either inconsistently applied or not applied at all, 
resulting in other dimension values 'leaking' into the results despite being 
ruled out by the filter. This behavior is almost reminiscient of some sort of 
MVD edge case, but again, the fields experiencing this issue are strictly 
singular string values (and, as mentioned further down, the behavior changes 
between different points of the segment's lifecycle).
   
   Consider the following minimally-reproducible query, a GroupBy that groups 
and filters by an `example_field` dimension. 
   
   ```
   {
     "queryType": "groupBy",
     "dataSource": "Example_Records",
     "granularity": "all",
     "filter": {
         "type": "selector",
         "dimension": "example_field",
         "value": "expected_value"
      },
     "dimensions": ["example_field"],
     "intervals": [
       "2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
     ]
   }
   ```
   Assuming `example_field` is guaranteed to be a simple string value (and is 
identified as such in the schema), this query should return at a maximum 1 row 
- the value `expected_value`. However, that is not what happens. 
   
   - When executed on a data range that still resides on Middle Managers, this 
query returns between 20-40 different rows with miscellaneous values for 
`example_field`.
   - When executed on a data range that has been successfully handed off to 
Historicals, this query returns the correct / expected value of only 
`expected_value`.
   - When the same query is executed twice with a 3-hour delay between runs, it 
will first return the nonsensical result - and then later return the expected 
result - indicating a behavior change between the comparable Middle Manager and 
Historical queries.
   
   Oddly enough, a modification to the original query appears to fix it. If an 
additional dimension - even one that doesn't exist - is added to the query 
(ordering does not matter), it returns the expected result 100% of the time: 
   
   ```
   {
     "queryType": "groupBy",
     "dataSource": "Sample_Sessions",
     "granularity": "all",
     "filter": {
         "type": "selector",
         "dimension": "example_field",
         "value": "expected_value"
      },
     "dimensions": ["example_field", "oof"],
     "intervals": [
       "2023-10-17T00:00:00+0000/2023-10-17T20:55:00+0000"
     ]
   }
   ```
   
   The above query will always return one row with an `example_field` value of 
`expected_value` and an `oof` value of `null`, somehow avoiding the nonsensical 
condition of the first query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Middle Managers return nonsensical results for GroupBy and Timeseries queries when Schema Auto-Discovery is enabled (druid)

Reply via email to