jon-wei commented on a change in pull request #6122: New docs intro URL: https://github.com/apache/incubator-druid/pull/6122#discussion_r208768392
########## File path: docs/content/ingestion/overview.md ########## @@ -0,0 +1,279 @@ +--- +layout: doc_page +--- + +# Ingestion + +## Overview + +### Datasources and segments + +Druid data is stored in "datasources", which are similar to tables in a traditional RDBMS. Each datasource is +partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a "chunk" (for +example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more +"segments". Each segment is a single file, typically comprising up to a few million rows of data. Since segments are +organized into time chunks, it's sometimes helpful to think of segments as living on a timeline like the following: + +<img src="../../img/druid-timeline.png" width="800" /> + +A datasource may have anywhere from just a few segments, up to hundreds of thousands and even millions of segments. Each +segments starts life off being created on a MiddleManger, and at that point, is mutable and uncommitted. The segment +building process includes the following steps, designed to produce a data file that is compact and supports fast +queries: + +- Conversion to columnar format +- Indexing with bitmap indexes +- Compression using various algorithms + - Dictionary encoding with id storage minimization for String columns + - Bitmap compression for bitmap indexes + - Type-aware compression for all columns + +Periodically, segments are published (committed). At this point, they are written to deep storage, become immutable, and +move from MiddleManagers to the Historical processes. An entry about the segment is also written to the metadata store. +This entry is a self- describing bit of metadata about the segment, including things like the schema of the segment, its +size, and its location on deep storage. These entries are what the Coordinator uses to know what data *should* be +available on the cluster. + +For details on the segment file format, please see [segment files](../design/segments.html) + +#### Segment identifiers + +Segments all have a four-part identifier with the following components: + +- Datasource name. +- Time interval (for the time chunk containing the segment; this corresponds to the `segmentGranularity` specified +at ingestion time). +- Version number (generally an ISO8601 timestamp corresponding to when the segment set was first started). +- Partition number (an integer, unique within a datasource+interval+version; may not necessarily be contiguous). + +For example, this is the identifier for a segment in datasource `clarity-cloud0`, time chunk +`2018-05-21T16:00:00.000Z/2018-05-21T17:00:00.000Z`, version `2018-05-21T15:56:09.909Z`, and partition number 1: + +``` +clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:56:09.909Z_1 +``` + +Segments with partition number 0 (the first partition in a chunk) omit the partition number, like the following +example, which is a segment in the same time chunk as the previous one, but with partition number 0 instead of 1: + +``` +clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:56:09.909Z +``` + +#### Segment versioning + +You may be wondering what the "version number" described in the previous section is for. Or, you might not be, in which +case good for you and you can skip this section! + +It's there to support batch-mode overwriting. In Druid, if all you ever do is append data, then there will be just a +single version for each time chunk. But when you overwrite data, what happens behind the scenes is that a new set of +segments is created with the same datasource, same time interval, but a higher version number. This is a signal to the +rest of the Druid system that the older version should be removed from the cluster, and the new version should replace +it. + +The switch appears to happen instantaneously to a user, because Druid handles this by first loading the new data (but +not allowing it to be queried), and then, as soon as the new data is all loaded, switching all new queries to use those +new segments. Then it drops the old segments a few minutes later. + + +#### Segment states + +Segments can be either _available_ or _unavailable_, which refers to whether or not they are currently served by some +Druid server process. They can also be _published_ or _unpublished_, which refers to whether or not they have been +written to deep storage and the metadata store. And published segments can be either _used_ or _unused_, which refers to +whether or not Druid considers them active segments that should be served. + +Putting these together, there are five basic states that a segment can be in: + +- **Published, available, and used:** These segments are published in deep storage and the metadata store, and they are +served by Historical processes. They are the majority of active data in a Druid cluster (they include everything except +in-flight realtime data). +- **Published, available, and unused:** These segments are being served by Historicals, but won't be for very long. They +are may be segments that have recently been overwritten (see [Segment versioning](#segment-versioning)) or dropped for +other reasons (like drop rules, or being dropped manually). +- **Published, unavailable, and used:** These segments are published in deep storage and the metadata store, and +_should_ be served, but are not actually being served. If segments stay in this state for more than a few minutes, it's +usually because something is wrong. Some of the more common causes include: failure of a large number of Historicals; Review comment: Changed to commas ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
