apucher opened a new issue #5117: Synthetic Data Generator
URL: https://github.com/apache/incubator-pinot/issues/5117
 
 
   **Design Doc**: 
https://cwiki.apache.org/confluence/display/INCUBATOR/Synthetic+Data+Generator+for+Pinot
   
   As Pinot moves forward and becomes easier to set up and explore for humans, 
we're hitting a limit in terms of (a) what data sets we can include and (b) how 
much data we can package with the distribution and (c) how well these data sets 
showcase Apache Pinot and its ecosystem. This is true for both, the source 
distributions and the pre-made docker images. Many public data sets are 
available for personal or academic use only and therefore, strictly speaking, 
prevent Apache Pinot from packaging or including them in other ways. 
Additionally, we can only package so much data before bloating the size of the 
repository and images. Finally, pre-existing data sets may not be able to 
showcase or stress a very specific part of Pinot for testing or demonstration 
purposes.
   
   One way we could work around this limitation is by generating synthetic 
"mock" data that looks and feels like real datasets without actually including 
the original data. Instead of shipping pre-made data sets we can generate time 
series from templates and features that we designed or extracted previously. 
This works around both licensing and capacity issues, and allows us to generate 
well-suited testing and demo data on-demand.
   
   **Proposed Approach**
   We want to add support for complex data generator "templates" to 
pinot-admin. The existing tool already has rudimentary abilities to generate 
data for benchmarking or testing, but this data is strictly random noise and 
usually unsuited for dimensional breakdowns. We propose to add generator 
templates that produce time series that would appear familiar to developers, 
analysts, and other stakeholders of businesses and intuitively "make sense". 
For example, these templates could produce diurnal (day-night) page view and 
click time series for an imaginary website or long-tail (spiky) error metrics 
that sensibly de-compose into multiple dimensions. This approach is trivially 
extensible and new templates can be added as needed.
   
   We would re-use pinot-admins "GenerateData" command and extend the existing 
schema-annotations with a "template" property that enables both pinot 
contributors as well as pinot users to configure arbitrary generator templates 
in the familiar JSON format. We provide several examples in the design doc.
   
   **Time Series Examples**
   Selection. below. See design doc for more examples.
   
   Seasonal time series
   
![image](https://user-images.githubusercontent.com/25439965/76014202-0fd34780-5ece-11ea-92a0-cab08e03bf76.png)
   
   Rare events time series
   
![image](https://user-images.githubusercontent.com/25439965/76014281-2d081600-5ece-11ea-9e0b-a06ac0241295.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to