vogievetsky opened a new issue #7502: [Proposal] Data loader GUI URL: https://github.com/apache/incubator-druid/issues/7502 ### Motivation Druid’s ingestion system is very versatile however currently the [ingestion specs]() in Druid are complicated and require a meticulous examination of the documentation to craft. Some of the complexity in the ingestion specs is due to the fact that they evolved naturally over the lifetime of the project, over time supporting more and more features. Another difficulty is that there is no way to know if an ingestion spec will work as intended without submitting it to Druid and seeing if it works. A well made, step-by-step GUI with a helpful wizard would not only make it easier to get data loaded into Druid but can also serve as an educational tool for all the features of the ingestion system. There would be less frustration and fewer failed tasks if a user could iteratively get feedback at every step of the process. ### Proposed changes #### Web interface The proposed change is to build on top of the [Druid web console](https://github.com/apache/incubator-druid/pull/6923) to create a GUI spec editor / data loader wizard. The specific ‘design direction’ is like that of TurboTax (or similar software). The ingestion spec will always be available for viewing / editing but the user is expected to interact with it through a series of steps that are arranged in a specific logical flow building one on top of the other. The web console (GUI) change will also be accompanied by a Druid based “sampler” that will be able to accept a partial spec and preview the resulting Druid data structure that will be generated. The UI would make repeated calls to the sampler module showing the user a progressively more refined preview as they progress through the steps (just like TurboTax gives a refund preview). At the high level the data loader would guide the user through the following steps: - Connect and parse raw data - input config - configure the input location (kafka, s3, local, e.t.c) - parser - configure the parser (json, csv, tsv, regex, custom) - timestamp - configure the timestampSpec - Transform and configure schema - transform - configure the transforms - filter - configure the ingest time filter (if any) - dimensions + metrics - configure the dimensions and metrics - Parameters - partition - partitioning options like segment granularity, max segment size, secondary partitioning, and other options - tuning - define the tuning config - Full spec - see the full spec (like seeing the full tax return in TurboTax) Here are some potential designs for some of the steps: A user selects the input source (configuring the `ioConfig`) and get a preview of the raw data in that source:  A best effort is then made to chose a parser. The user can override the parser as needed:  Once the parser is chosen the timestamp parsing can be configured and previewed:  The transforms apply on the parsed data:  Then the user can preview the schema and add dimensions and metrics (and turn rollup on and off) as needed:  And so on through all the steps. At any point the user can jump to see / directly edit the full spec:  And then they are satisfied click “Submit” safe in the knowledge that everything will work. #### Data Sampler The data loader will be powered by a new sampler implementation that will be added to Druid’s core codebase. Some of the primary classes/interfaces that will be added/modified: ##### Sampler ``` @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type") @JsonSubTypes(value = { @JsonSubTypes.Type(name = "index", value = IndexTaskSampler.class) }) public interface Sampler { SamplerResponse sample(); } ``` ##### SamplerResource Adds an endpoint on the overlord at `POST /druid/indexer/v1/sampler` and receives an object in the request body that implements the `Sampler` interface. ##### IndexTaskSampler This is an example of a sampler spec that can be used with any input source compatible with a native indexing task (i.e. `Firehose` based). Note that it takes a complete `IndexIngestionSpec`, meaning that in addition to supporting arbitrary firehoses, it can also apply arbitrary parseSpecs, transformSpecs, aggregators, and apply query granularity (the latter two are for previewing the effects of rollup). ``` public class IndexTaskSampler implements Sampler { @JsonCreator public IndexTaskSampler( @JsonProperty("spec") final IndexTask.IndexIngestionSpec ingestionSchema, @JsonProperty("samplerConfig") final SamplerConfig samplerConfig ) ``` Non-firehose based ingestion methods (e.g. Kafka and Kinesis indexing) would add additional `Sampler` implementations. ##### SamplerResponseRow ``` public class SamplerResponseRow { private final String raw; private final Map<String, Object> parsed; private final Boolean unparseable; private final String error; private final Boolean excludedByFilter; } ``` Note that in addition to providing the `parsed` row after it has been parsed and run through an `IncrementalIndex`, it also includes the `raw` row which is helpful for the initial stages of the data loader, to determine that you are reading the intended source data and applying the correct parseSpec and timestampSpec. ##### Firehose The Firehose interface needs to be modified to allow implementations to return the raw row (where applicable) in addition to the InputRow. A default method that does not return the raw row will handle implementations that do not support this: ``` /** * Returns an InputRowPlusRaw object containing the InputRow plus the raw, unparsed data corresponding to the next row * available. Used in the sampler to provide the caller with information to assist in configuring a parse spec. If a * ParseException is thrown by the parser, it should be caught and returned in the InputRowPlusRaw so we will be able * to provide information on the raw row which failed to be parsed. Should only be called if hasMore returns true. * * @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException */ default InputRowPlusRaw nextRowWithRaw() { try { return InputRowPlusRaw.of(nextRow(), null); } catch (ParseException e) { return InputRowPlusRaw.of(null, e); } } ``` ##### FirehoseFactory The FirehoseFactory interface may need to be modified to add a connection method suitable for a sampler. This means indicating to the sampler that we only need a limited amount of data and not to pre-cache an entire directory of files (`PrefretchableTextFilesFirehoseFactory` I'm looking at you!). We may be able to skip this and document to the API consumer to use appropriate Firehose configurations for the sampler: ``` default Firehose connectForSampler(T parser, @Nullable File temporaryDirectory) throws IOException, ParseException { return connect(parser, temporaryDirectory); } ``` ##### Other considerations A desirable design goal for the sampler is that it should return "processed" results on the same set of raw data every time, and the data should maintain a consistent ordering whenever possible (ideally the order it was read by the Firehose). This will make for a better user experience as they go through the different pages of the data loader (raw data -> parsed -> timestamp column identified -> transformed -> filtered -> column data types applied). The proposal is to add a temporary internal column as a 'metric' that we can use to sort the results we read back out of the `IncrementalIndex`. For streaming-based sources, we would also want to cache the raw data so that we can continually feed the same raw data into the parser so the user can see the effects of their changes. Our proposal is to use one of Druid's caching implementations to support this, perhaps the CaffeineCache. ### Rationale This would greatly simplify the complexity of on-boarding data into Druid. A potential alternative to this would be to simply invest effort into simplifying the ingestion spec API. ### Operational impact None ### Test plan The data loader will be tested and improved through user testing ### Future work Assuming the above proposed project is successful it would be beneficial to adjust the existing data loading documentation to be focused more on the data loading flow. Furthermore if the logical model of the data loader flow above intuitively connects with people it could foster changes to the ingestion spec API itself.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
