techdocsmith commented on code in PR #12808: URL: https://github.com/apache/druid/pull/12808#discussion_r926021879
########## docs/design/segments.md: ########## @@ -143,37 +143,33 @@ the 'column data' is an array of values. Additionally, a row with *n* values in 'column data' will have *n* non-zero valued entries in bitmaps. -## SQL Compatible Null Handling +## SQL compatible null handling By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides a SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`. String dimension columns contain no additional column structures in this mode, instead just reserving an additional dictionary entry for the `null` value. Numeric columns however will be stored in the segment with an additional bitmap whose set bits indicate `null` valued rows. In addition to slightly increased segment sizes, SQL compatible null handling can incur a performance cost at query time as well, due to the need to check the null bitmap. This performance cost only occurs for columns that actually contain nulls. -## Naming Convention +## Naming convention Identifiers for segments are typically constructed using the segment datasource, interval start time (in ISO 8601 format), interval end time (in ISO 8601 format), and a version. If data is additionally sharded beyond a time range, the segment identifier will also contain a partition number. An example segment identifier may be: datasource_intervalStart_intervalEnd_version_partitionNum -## Segment Components +## Segment components Behind the scenes, a segment is comprised of several files, listed below. * `version.bin` - 4 bytes representing the current segment version as an integer. E.g., for v9 segments, the version is 0x0, 0x0, 0x0, 0x9 + 4 bytes representing the current segment version as an integer. For example, for v9 segments, the version is 0x0, 0x0, 0x0, 0x9. * `meta.smoosh` - A file with metadata (filenames and offsets) about the contents of the other `smoosh` files + A file with metadata (filenames and offsets) about the contents of the other `smoosh` files. * `XXXXX.smoosh` - There are some number of these files, which are concatenated binary data Review Comment: We should just fix lines 172 & 174 rather than remove entirely. Smoosh (`.smoosh`) files are concatenated binary data. This file consolidation reduces the number of file descriptors that must be open when accessing data. The files should be 2GB or less in size to remain within the limit of a memory mapped `ByteBuffer` in Java. Smoosh files contain individual files for each column in the data and an `index.drd` file that contains additional segment metadata. ########## docs/design/segments.md: ########## @@ -143,37 +143,33 @@ the 'column data' is an array of values. Additionally, a row with *n* values in 'column data' will have *n* non-zero valued entries in bitmaps. -## SQL Compatible Null Handling +## SQL compatible null handling By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides a SQL compatible null handling mode, which must be enabled at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will allow Druid to _at ingestion time_ create segments whose string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`. String dimension columns contain no additional column structures in this mode, instead just reserving an additional dictionary entry for the `null` value. Numeric columns however will be stored in the segment with an additional bitmap whose set bits indicate `null` valued rows. In addition to slightly increased segment sizes, SQL compatible null handling can incur a performance cost at query time as well, due to the need to check the null bitmap. This performance cost only occurs for columns that actually contain nulls. -## Naming Convention +## Naming convention Identifiers for segments are typically constructed using the segment datasource, interval start time (in ISO 8601 format), interval end time (in ISO 8601 format), and a version. If data is additionally sharded beyond a time range, the segment identifier will also contain a partition number. An example segment identifier may be: datasource_intervalStart_intervalEnd_version_partitionNum -## Segment Components +## Segment components Behind the scenes, a segment is comprised of several files, listed below. * `version.bin` - 4 bytes representing the current segment version as an integer. E.g., for v9 segments, the version is 0x0, 0x0, 0x0, 0x9 + 4 bytes representing the current segment version as an integer. For example, for v9 segments, the version is 0x0, 0x0, 0x0, 0x9. * `meta.smoosh` - A file with metadata (filenames and offsets) about the contents of the other `smoosh` files + A file with metadata (filenames and offsets) about the contents of the other `smoosh` files. * `XXXXX.smoosh` - There are some number of these files, which are concatenated binary data - - The `smoosh` files represent multiple files "smooshed" together in order to minimize the number of file descriptors that must be open to house the data. They are files of up to 2GB in size (to match the limit of a memory mapped ByteBuffer in Java). The `smoosh` files house individual files for each of the columns in the data as well as an `index.drd` file with extra metadata about the segment. - - There is also a special column called `__time` that refers to the time column of the segment. This will hopefully become less and less special as the code evolves, but for now it’s as special as my Mommy always told me I am. Review Comment: We should leave line 176. Maybe we can clarify: Each segment must contain a column called `__time` as the time (dimension?) of the segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
