surekhasaharan commented on a change in pull request #6122: New docs intro
URL: https://github.com/apache/incubator-druid/pull/6122#discussion_r208769975
 
 

 ##########
 File path: docs/content/design/index.md
 ##########
 @@ -2,152 +2,193 @@
 layout: doc_page
 ---
 
-# Druid Concepts
+# What is Druid?
+
+Druid is a data store designed for high-performance slice-and-dice analytics
+("[OLAP](http://en.wikipedia.org/wiki/Online_analytical_processing)"-style) on 
large data sets. Druid is most often
+used as a data store for powering GUI analytical applications, or as a backend 
for highly-concurrent APIs that need
+fast aggregations. Common application areas for Druid include:
+
+- Clickstream analytics
+- Network flow analytics
+- Server metrics storage
+- Application performance metrics
+- Digital marketing analytics
+- Business intelligence / OLAP
+
+Druid's key features are:
+
+1. **Columnar storage format.** Druid uses column-oriented storage, meaning it 
only needs to load the exact columns
+needed for a particular query.  This gives a huge speed boost to queries that 
only hit a few columns. In addition, each
+column is stored optimized for its particular data type, which supports fast 
scans and aggregations.
+2. **Scalable distributed system.** Druid is typically deployed in clusters of 
tens to hundreds of servers, and can
+offer ingest rates of millions of records/sec, retention of trillions of 
records, and query latencies of sub-second to a
+few seconds.
+3. **Massively parallel processing.** Druid can process a query in parallel 
across the entire cluster.
+4. **Realtime or batch ingestion.** Druid can ingest data either realtime 
(ingested data is immediately available for
+querying) or in batches.
+5. **Self-healing, self-balancing, easy to operate.** As an operator, to scale 
the cluster up or down, simply add or
+remove servers and the cluster will rebalance itself automatically, in the 
background, without any downtime. If any
+Druid servers fail, the system will automatically route around the damage 
until those servers can be replaced. Druid
+is designed to run 24/7 with no need for planned downtimes for any reason, 
including configuration changes and software
+updates.
+6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once 
Druid has ingested your data, a copy is
+stored safely in [deep storage](#deep-storage) (typically cloud storage, HDFS, 
or a shared filesystem). Your data can be
+recovered from deep storage even if every single Druid server fails. For more 
limited failures affecting just a few
+Druid servers, replication ensures that queries are still possible while the 
system recovers.
+7. **Indexes for quick filtering.** Druid uses 
[CONCISE](https://arxiv.org/pdf/1004.0403) or
+[Roaring](https://roaringbitmap.org/) compressed bitmap indexes to create 
indexes that power fast filtering and
+searching across multiple columns.
+8. **Approximate algorithms.** Druid includes algorithms for approximate 
count-distinct, approximate ranking, and
+computation of approximate histograms and quantiles. These algorithms offer 
bounded memory usage and are often
+substantially faster than exact computations. For situations where exactness 
is more important than speed, Druid also
 
 Review comment:
   Might be better to write as `where accuracy is more important than speed`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to