vogievetsky opened a new issue #7502: [Proposal] Data loader GUI
URL: https://github.com/apache/incubator-druid/issues/7502
 
 
   ### Motivation
   
   Druid’s ingestion system is very versatile however currently the [ingestion 
specs]() in Druid are complicated and require a meticulous examination of the 
documentation to craft. Some of the complexity in the ingestion specs is due to 
the fact that they evolved naturally over the lifetime of the project, over 
time supporting more and more features. Another difficulty is that there is no 
way to know if an ingestion spec will work as intended without submitting it to 
Druid and seeing if it works.
   
   A well made, step-by-step GUI with a helpful wizard would not only make it 
easier to get data loaded into Druid but can also serve as an educational tool 
for all the features of the ingestion system. There would be less frustration 
and fewer failed tasks if a user could iteratively get feedback at every step 
of the process. 
   
   ### Proposed changes
   
   #### Web interface
   
   The proposed change is to build on top of the [Druid web 
console](https://github.com/apache/incubator-druid/pull/6923) to create a GUI 
spec editor / data loader wizard.
   
   The specific ‘design direction’ is like that of TurboTax (or similar 
software). The ingestion spec will always be available for viewing / editing 
but the user is expected to interact with it through a series of steps that are 
arranged in a specific logical flow building one on top of the other. The web 
console (GUI) change will also be accompanied by a Druid based “sampler” that 
will be able to accept a partial spec and preview the resulting Druid data 
structure that will be generated. The UI would make repeated calls to the 
sampler module showing the user a progressively more refined preview as they 
progress through the steps (just like TurboTax gives a refund preview).
   
   At the high level the data loader would guide the user through the following 
steps:
   
   - Connect and parse raw data
     - input config - configure the input location (kafka, s3, local, e.t.c)
     - parser - configure the parser (json, csv, tsv, regex, custom)
     - timestamp - configure the timestampSpec
   - Transform and configure schema
     - transform - configure the transforms
     - filter - configure the ingest time filter (if any)
     - dimensions + metrics - configure the dimensions and metrics
   - Parameters
     - partition - partitioning options like segment granularity, max segment 
size, secondary partitioning, and other options
     - tuning - define the tuning config
   - Full spec - see the full spec (like seeing the full tax return in TurboTax)
   
   Here are some potential designs for some of the steps:
   
   A user selects the input source (configuring the `ioConfig`) and get a 
preview of the raw data in that source:
   
   ![localhost_18081_unified-console html(Doc Screenshot) 
(2)](https://user-images.githubusercontent.com/177816/56330024-25ffda00-613b-11e9-8a46-e471c29013f3.png)
   
   A best effort is then made to chose a parser. The user can override the 
parser as needed:
   
   ![localhost_18081_unified-console html(Doc Screenshot) 
(3)](https://user-images.githubusercontent.com/177816/56330028-2b5d2480-613b-11e9-94e2-52ffdf974ae2.png)
   
   
   Once the parser is chosen the timestamp parsing can be configured and 
previewed:
   
   ![localhost_18081_unified-console html(Doc Screenshot) 
(5)](https://user-images.githubusercontent.com/177816/56330057-4af44d00-613b-11e9-8203-b76d15a0522d.png)
   
   The transforms apply on the parsed data:
   
   ![localhost_18081_unified-console html(Doc Screenshot) 
(4)](https://user-images.githubusercontent.com/177816/56330042-416ae500-613b-11e9-8b86-e6f2ea1b41ec.png)
   
   Then the user can preview the schema and add dimensions and metrics (and 
turn rollup on and off) as needed:
   
   ![localhost_18081_unified-console html(Doc Screenshot) 
(6)](https://user-images.githubusercontent.com/177816/56330104-79722800-613b-11e9-81fa-2823c17b1ac6.png)
   
   
   And so on through all the steps.
   
   At any point the user can jump to see / directly edit the full spec:
   
   ![localhost_18081_unified-console html(Doc Screenshot) 
(7)](https://user-images.githubusercontent.com/177816/56330107-7e36dc00-613b-11e9-8a0a-bd60f753f74e.png)
   
   
   And then they are satisfied click “Submit” safe in the knowledge that 
everything will work.
   
   #### Data Sampler
   
   The data loader will be powered by a new sampler implementation that will be 
added to Druid’s core codebase. Some of the primary classes/interfaces that 
will be added/modified:
   
   ##### Sampler
   
   ```
   @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type")
   @JsonSubTypes(value = {
       @JsonSubTypes.Type(name = "index", value = IndexTaskSampler.class)
   })
   public interface Sampler
   {
     SamplerResponse sample();
   }
   ```
   
   ##### SamplerResource
   Adds an endpoint on the overlord at `POST /druid/indexer/v1/sampler` and 
receives an object in the request body that implements the `Sampler` interface.
   
   ##### IndexTaskSampler
   This is an example of a sampler spec that can be used with any input source 
compatible with a native indexing task (i.e. `Firehose` based). Note that it 
takes a complete `IndexIngestionSpec`, meaning that in addition to supporting 
arbitrary firehoses, it can also apply arbitrary parseSpecs, transformSpecs, 
aggregators, and apply query granularity (the latter two are for previewing the 
effects of rollup).
   
   ```
   public class IndexTaskSampler implements Sampler
   {
     @JsonCreator
     public IndexTaskSampler(
         @JsonProperty("spec") final IndexTask.IndexIngestionSpec 
ingestionSchema,
         @JsonProperty("samplerConfig") final SamplerConfig samplerConfig
     )
   ```
   
   Non-firehose based ingestion methods (e.g. Kafka and Kinesis indexing) would 
add additional `Sampler` implementations.
   
   ##### SamplerResponseRow
   
   ```
   public class SamplerResponseRow
   {
     private final String raw;
     private final Map<String, Object> parsed;
     private final Boolean unparseable;
     private final String error;
     private final Boolean excludedByFilter;
   }
   ```
   
   Note that in addition to providing the `parsed` row after it has been parsed 
and run through an `IncrementalIndex`, it also includes the `raw` row which is 
helpful for the initial stages of the data loader, to determine that you are 
reading the intended source data and applying the correct parseSpec and 
timestampSpec.
   
   ##### Firehose
   The Firehose interface needs to be modified to allow implementations to 
return the raw row (where applicable) in addition to the InputRow. A default 
method that does not return the raw row will handle implementations that do not 
support this:
   
   ```
     /**
      * Returns an InputRowPlusRaw object containing the InputRow plus the raw, 
unparsed data corresponding to the next row
      * available. Used in the sampler to provide the caller with information 
to assist in configuring a parse spec. If a
      * ParseException is thrown by the parser, it should be caught and 
returned in the InputRowPlusRaw so we will be able
      * to provide information on the raw row which failed to be parsed. Should 
only be called if hasMore returns true.
      *
      * @return an InputRowPlusRaw which may contain any of: an InputRow, the 
raw data, or a ParseException
      */
     default InputRowPlusRaw nextRowWithRaw()
     {
       try {
         return InputRowPlusRaw.of(nextRow(), null);
       } catch (ParseException e) {
         return InputRowPlusRaw.of(null, e);
       }
     }
   ```
   
   ##### FirehoseFactory
   The FirehoseFactory interface may need to be modified to add a connection 
method suitable for a sampler. This means indicating to the sampler that we 
only need a limited amount of data and not to pre-cache an entire directory of 
files (`PrefretchableTextFilesFirehoseFactory` I'm looking at you!). We may be 
able to skip this and document to the API consumer to use appropriate Firehose 
configurations for the sampler:
   
   ```
     default Firehose connectForSampler(T parser, @Nullable File 
temporaryDirectory)  throws IOException, ParseException
     {
       return connect(parser, temporaryDirectory);
     }
   ```
   
   ##### Other considerations
   
   A desirable design goal for the sampler is that it should return "processed" 
results on the same set of raw data every time, and the data should maintain a 
consistent ordering whenever possible (ideally the order it was read by the 
Firehose). This will make for a better user experience as they go through the 
different pages of the data loader (raw data -> parsed -> timestamp column 
identified -> transformed -> filtered -> column data types applied). The 
proposal is to add a temporary internal column as a 'metric' that we can use to 
sort the results we read back out of the `IncrementalIndex`. For 
streaming-based sources, we would also want to cache the raw data so that we 
can continually feed the same raw data into the parser so the user can see the 
effects of their changes. Our proposal is to use one of Druid's caching 
implementations to support this, perhaps the CaffeineCache.
   
   ### Rationale
   
   This would greatly simplify the complexity of on-boarding data into Druid.
   
   A potential alternative to this would be to simply invest effort into 
simplifying the ingestion spec API.
   
   ### Operational impact
   
   None
   
   ### Test plan
   
   The data loader will be tested and improved through user testing
   
   ### Future work
   
   Assuming the above proposed project is successful it would be beneficial to 
adjust the existing data loading documentation to be focused more on the data 
loading flow. Furthermore if the logical model of the data loader flow above 
intuitively connects with people it could foster changes to the ingestion spec 
API itself.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to