siddharthteotia opened a new pull request #4747: Data Anonymizer Tool
URL: https://github.com/apache/incubator-pinot/pull/4747
 
 
   It is not always possible to use **production data** for:
   
   - Writing regression test frameworks without being dependent on externally 
available datasets and production data.
   - Performance benchmarking and functional evaluation of other systems like 
Druid, Azure Data Explorer (Kusto), Impala etc.
   - Estimation - How will we do on 10TB of data? The tool will allow to 
generate any volume of data for any arbitrary schema.
   
   **Why not use public datasets?**
   
   - Some of the public data sets we explored are modest volume datasets 
(1million to few million rows) compared to what Pinot currently runs on in 
production.
   - We want to do the benchmarking on a dataset that closely resembles prod 
data to make more informed decision from data points.
   - TPCH would have been useful but most queries are join centric.
   
   The tool first understands the characteristics of production data that Pinot 
runs on and uses those characteristics to generate irreversible random data 
(one Avro file per segment).
   
   Preserves characteristics like cardinality, distribution of values, length.
   
   Data generation approach preserves query patterns. Example:
   
   SELECT * FROM Pinot_Prod_Table WHERE COL < 20000
   
   The tool build a sorted global dictionary which ensures that when it maps 
20000 to a random generated value V, all the original column values < 20000 are 
also mapped to random generated values < V. With this approach, if the original 
query returned 100k rows on the actual data, the generated query should also 
return roughly the same number of rows on generated data.
   
   The code is very well documented and explains the 
purpose/usage/implementation notes in detail.
   
   (1) Filter column extractor from query file
   (2) Global dictionary builder
   (3) Data generator
   (4) Query generator

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to