[GitHub] [druid] techdocsmith commented on a diff in pull request #12808: Remove the time bit, fix headings

GitBox Wed, 20 Jul 2022 13:21:47 -0700


techdocsmith commented on code in PR #12808:
URL: https://github.com/apache/druid/pull/12808#discussion_r926021879



##########
docs/design/segments.md:
##########
@@ -143,37 +143,33 @@ the 'column data' is an array of values. Additionally, a 
row with *n*
 values in 'column data' will have *n* non-zero valued entries in
 bitmaps.
 
-## SQL Compatible Null Handling
+## SQL compatible null handling
 By default, Druid string dimension columns use the values `''` and `null` 
interchangeably and numeric and metric columns can not represent `null` at all, 
instead coercing nulls to `0`. However, Druid also provides a SQL compatible 
null handling mode, which must be enabled at the system level, through 
`druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will 
allow Druid to _at ingestion time_ create segments whose string columns can 
distinguish `''` from `null`, and numeric columns which can represent `null` 
valued rows instead of `0`.
 
 String dimension columns contain no additional column structures in this mode, 
instead just reserving an additional dictionary entry for the `null` value. 
Numeric columns however will be stored in the segment with an additional bitmap 
whose set bits indicate `null` valued rows. In addition to slightly increased 
segment sizes, SQL compatible null handling can incur a performance cost at 
query time as well, due to the need to check the null bitmap. This performance 
cost only occurs for columns that actually contain nulls.
 
-## Naming Convention
+## Naming convention
 
 Identifiers for segments are typically constructed using the segment 
datasource, interval start time (in ISO 8601 format), interval end time (in ISO 
8601 format), and a version. If data is additionally sharded beyond a time 
range, the segment identifier will also contain a partition number.
 
 An example segment identifier may be:
 datasource_intervalStart_intervalEnd_version_partitionNum
 
-## Segment Components
+## Segment components
 
 Behind the scenes, a segment is comprised of several files, listed below.
 
 * `version.bin`
 
-    4 bytes representing the current segment version as an integer. E.g., for 
v9 segments, the version is 0x0, 0x0, 0x0, 0x9
+    4 bytes representing the current segment version as an integer. For 
example, for v9 segments, the version is 0x0, 0x0, 0x0, 0x9.
 
 * `meta.smoosh`
 
-    A file with metadata (filenames and offsets) about the contents of the 
other `smoosh` files
+    A file with metadata (filenames and offsets) about the contents of the 
other `smoosh` files.
 
 * `XXXXX.smoosh`
 
-    There are some number of these files, which are concatenated binary data

Review Comment:
   We should just fix lines 172 & 174 rather than remove entirely.
   
   Smoosh (`.smoosh`) files are concatenated binary data. This file 
consolidation reduces the number of file descriptors that must be open when 
accessing data. The files should be 2GB or less in size to remain within the 
limit of a memory mapped `ByteBuffer` in Java. Smoosh files contain individual 
files for each column in the data and an `index.drd` file that contains 
additional segment metadata.



##########
docs/design/segments.md:
##########
@@ -143,37 +143,33 @@ the 'column data' is an array of values. Additionally, a 
row with *n*
 values in 'column data' will have *n* non-zero valued entries in
 bitmaps.
 
-## SQL Compatible Null Handling
+## SQL compatible null handling
 By default, Druid string dimension columns use the values `''` and `null` 
interchangeably and numeric and metric columns can not represent `null` at all, 
instead coercing nulls to `0`. However, Druid also provides a SQL compatible 
null handling mode, which must be enabled at the system level, through 
`druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will 
allow Druid to _at ingestion time_ create segments whose string columns can 
distinguish `''` from `null`, and numeric columns which can represent `null` 
valued rows instead of `0`.
 
 String dimension columns contain no additional column structures in this mode, 
instead just reserving an additional dictionary entry for the `null` value. 
Numeric columns however will be stored in the segment with an additional bitmap 
whose set bits indicate `null` valued rows. In addition to slightly increased 
segment sizes, SQL compatible null handling can incur a performance cost at 
query time as well, due to the need to check the null bitmap. This performance 
cost only occurs for columns that actually contain nulls.
 
-## Naming Convention
+## Naming convention
 
 Identifiers for segments are typically constructed using the segment 
datasource, interval start time (in ISO 8601 format), interval end time (in ISO 
8601 format), and a version. If data is additionally sharded beyond a time 
range, the segment identifier will also contain a partition number.
 
 An example segment identifier may be:
 datasource_intervalStart_intervalEnd_version_partitionNum
 
-## Segment Components
+## Segment components
 
 Behind the scenes, a segment is comprised of several files, listed below.
 
 * `version.bin`
 
-    4 bytes representing the current segment version as an integer. E.g., for 
v9 segments, the version is 0x0, 0x0, 0x0, 0x9
+    4 bytes representing the current segment version as an integer. For 
example, for v9 segments, the version is 0x0, 0x0, 0x0, 0x9.
 
 * `meta.smoosh`
 
-    A file with metadata (filenames and offsets) about the contents of the 
other `smoosh` files
+    A file with metadata (filenames and offsets) about the contents of the 
other `smoosh` files.
 
 * `XXXXX.smoosh`
 
-    There are some number of these files, which are concatenated binary data
-
-    The `smoosh` files represent multiple files "smooshed" together in order 
to minimize the number of file descriptors that must be open to house the data. 
They are files of up to 2GB in size (to match the limit of a memory mapped 
ByteBuffer in Java). The `smoosh` files house individual files for each of the 
columns in the data as well as an `index.drd` file with extra metadata about 
the segment.
-
-    There is also a special column called `__time` that refers to the time 
column of the segment. This will hopefully become less and less special as the 
code evolves, but for now it’s as special as my Mommy always told me I am.

Review Comment:
   We should leave line 176. Maybe we can clarify: Each segment must contain a 
column called `__time` as the time (dimension?) of the segment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a diff in pull request #12808: Remove the time bit, fix headings

Reply via email to