kamaci opened a new issue #7255: [PROPOSAL] Schemeless Dimensions Support
URL: https://github.com/apache/incubator-druid/issues/7255
 
 
   ### Motivation
   
   Currently, when using schema-less ingestion, all dimensions will be ingested 
as String-typed dimensions [1]. However, one can define auto type detection for 
timestamp as follows:
   
   ```
   "timestampSpec" : {
        "format" : "auto",
        "column" : "ts"
   }
   ```
   In similar manner, one cannot detect field type via parseSpec. i.e.:
   
   
`{"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}`
   
   
`{"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}`
   
   Both `properties-key1` and `properties-key2` are indexed as `String`. It is 
expected to index `properties-key2` as `Integer` or `Long` at Druid.
   
   Auto field type detection should be implemented into Druid for a newly 
created field as similar to [Solr Schemaless 
Mode](https://lucene.apache.org/solr/guide/7_6/schemaless-mode.html)
   
   **PS 1**: I've started a conversation about this task at mail list and than 
created an issue for it: [Druid Auto Field Type 
Detection](https://lists.apache.org/thread.html/f39b9d8284ab9604bc4dde2ca38c515def730c779f735cb47531506b@<dev.druid.apache.org>)
   
   **PS 2**: I've created a discussion issue for possible solutions of this 
issue: https://github.com/apache/incubator-druid/issues/7027
   
   [1] [Druid Schema 
Design](http://druid.io/docs/latest/ingestion/schema-design.html)
   
   ### Proposed changes
   
   #### Current Implementation
   
   `IncrementalIndex.java` has following code section:
   
   ```
   if (desc != null) {
     capabilities = desc.getCapabilities();
   } else {
     wasNewDim = true;
     capabilities = columnCapabilities.get(dimension);
     if (capabilities == null) {
       capabilities = new ColumnCapabilitiesImpl();
       // For schemaless type discovery, assume everything is a String for now, 
can change later.
       capabilities.setType(ValueType.STRING);
       capabilities.setDictionaryEncoded(true);
       capabilities.setHasBitmapIndexes(true);
       columnCapabilities.put(dimension, capabilities);
     }
     DimensionHandler handler = 
DimensionHandlerUtils.getHandlerFromCapabilities(dimension, capabilities, null);
     desc = addNewDimension(dimension, capabilities, handler);
   }
   ```
   
   #### Suggested Implementation
   
   We can create a regex defined dimension for capturing non-defined dimensions 
to map types. Such configuration can be defined for this purpose:
   
   ```
   { "pattern": "*_l", "type": "long" },
   { "pattern": "*_f", "type": "float" }
   ```
   
   Flow will be as described below:
   
   1. If given dimension name is matched within defined schema, code will work 
as is.
   
   2. If given dimension name is not matched within defined schema:
   
   - First pattern checks dimension and if it ends with `_l` maps the dimension 
into a `Long` type. 
   
   - Second pattern check dimension with similar manner and if it ends with 
`_f` maps the dimension into a `Float` type. 
   
   3. If not any pattern matches, dimension is assigned into a `String` field 
as is.
   
   ### Rationale
   
   Other possible solutions are:
   
   1. If value can be convertible to `Float` then convert to `Float`, otherwise 
try with `Long`, otherwise go with `String`.
   
   2. If field name ends with `_l`, and is not defined at scheme, and a config 
value of `schemeless mode` is enabled then convert to `Long`, so on so forth.
   
   However, both solutions are not generic and vulnerable to having `null` 
values at first row for particular dimensions. Defining a regex pattern for 
such purpose is the best choice among these. 
   
   _Only open question is deciding the format of the regex pattern._
   
   ### Operational impact
   
   This improvement does not change existing code flow.
   
   ### Test plan
   
   Regular test codes will be implemented.
   
   ### Future work 
   
   We should consider https://github.com/apache/incubator-druid/issues/658 when 
implementing this issue.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to