kamaci opened a new issue #7255: [PROPOSAL] Schemeless Dimensions Support URL: https://github.com/apache/incubator-druid/issues/7255 ### Motivation Currently, when using schema-less ingestion, all dimensions will be ingested as String-typed dimensions [1]. However, one can define auto type detection for timestamp as follows: ``` "timestampSpec" : { "format" : "auto", "column" : "ts" } ``` In similar manner, one cannot detect field type via parseSpec. i.e.: `{"ts":"2018-01-01T03:35:45Z","app_token":"guid1","eventName":"app-x","properties-key1":"123"}` `{"ts":"2018-01-01T03:35:45Z","app_token":"guid2","eventName":"app-x","properties-key2":123}` Both `properties-key1` and `properties-key2` are indexed as `String`. It is expected to index `properties-key2` as `Integer` or `Long` at Druid. Auto field type detection should be implemented into Druid for a newly created field as similar to [Solr Schemaless Mode](https://lucene.apache.org/solr/guide/7_6/schemaless-mode.html) **PS 1**: I've started a conversation about this task at mail list and than created an issue for it: [Druid Auto Field Type Detection](https://lists.apache.org/thread.html/f39b9d8284ab9604bc4dde2ca38c515def730c779f735cb47531506b@<dev.druid.apache.org>) **PS 2**: I've created a discussion issue for possible solutions of this issue: https://github.com/apache/incubator-druid/issues/7027 [1] [Druid Schema Design](http://druid.io/docs/latest/ingestion/schema-design.html) ### Proposed changes #### Current Implementation `IncrementalIndex.java` has following code section: ``` if (desc != null) { capabilities = desc.getCapabilities(); } else { wasNewDim = true; capabilities = columnCapabilities.get(dimension); if (capabilities == null) { capabilities = new ColumnCapabilitiesImpl(); // For schemaless type discovery, assume everything is a String for now, can change later. capabilities.setType(ValueType.STRING); capabilities.setDictionaryEncoded(true); capabilities.setHasBitmapIndexes(true); columnCapabilities.put(dimension, capabilities); } DimensionHandler handler = DimensionHandlerUtils.getHandlerFromCapabilities(dimension, capabilities, null); desc = addNewDimension(dimension, capabilities, handler); } ``` #### Suggested Implementation We can create a regex defined dimension for capturing non-defined dimensions to map types. Such configuration can be defined for this purpose: ``` { "pattern": "*_l", "type": "long" }, { "pattern": "*_f", "type": "float" } ``` Flow will be as described below: 1. If given dimension name is matched within defined schema, code will work as is. 2. If given dimension name is not matched within defined schema: - First pattern checks dimension and if it ends with `_l` maps the dimension into a `Long` type. - Second pattern check dimension with similar manner and if it ends with `_f` maps the dimension into a `Float` type. 3. If not any pattern matches, dimension is assigned into a `String` field as is. ### Rationale Other possible solutions are: 1. If value can be convertible to `Float` then convert to `Float`, otherwise try with `Long`, otherwise go with `String`. 2. If field name ends with `_l`, and is not defined at scheme, and a config value of `schemeless mode` is enabled then convert to `Long`, so on so forth. However, both solutions are not generic and vulnerable to having `null` values at first row for particular dimensions. Defining a regex pattern for such purpose is the best choice among these. _Only open question is deciding the format of the regex pattern._ ### Operational impact This improvement does not change existing code flow. ### Test plan Regular test codes will be implemented. ### Future work We should consider https://github.com/apache/incubator-druid/issues/658 when implementing this issue.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
