jon-wei commented on a change in pull request #6122: New docs intro URL: https://github.com/apache/incubator-druid/pull/6122#discussion_r208768247
########## File path: docs/content/ingestion/overview.md ########## @@ -0,0 +1,279 @@ +--- +layout: doc_page +--- + +# Ingestion + +## Overview + +### Datasources and segments + +Druid data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is +partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for +example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more +"segments". Each segment is a single file, typically comprising up to a few million rows of data. Since segments are +organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following: + +<img src="../../img/druid-timeline.png" width="800" /> + +A datasource may have anywhere from just a few segments, up to hundreds of thousands and even millions of segments. Each +segments starts life off being created on a MiddleManger, and at that point, is mutable and uncommitted. The segment +building process includes the following steps, designed to produce a data file that is compact and supports fast +queries: + +- Conversion to columnar format +- Indexing with bitmap indexes +- Compression using various algorithms + - Dictionary encoding with id storage minimization for String columns + - Bitmap compression for bitmap indexes + - Type-aware compression for all columns + +Periodically, segments are published (committed). At this point, they are written to deep storage, become immutable, and +move from MiddleManagers to the Historical processes. An entry about the segment is also written to the metadata store. +This entry is a self- describing bit of metadata about the segment, including things like the schema of the segment, its +size, and its location on deep storage. These entries are what the Coordinator uses to know what data *should* be +available on the cluster. + +For details on the segment file format, please see [segment files](../design/segments.html) Review comment: added period ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
