writer-jill opened a new issue, #12614: URL: https://github.com/apache/druid/issues/12614
Suggestions to improve the segments.md and compaction.md files raised by Paul Rogers in the PR: https://github.com/apache/druid/pull/12344 but outside the scope of that change. --- segments.md -------------- Text: Note that the bitmap is different from the dictionary and list data structures: the dictionary and list grow linearly with the size of the data, but the size of the bitmap section is the product of data size * column cardinality. Paul Rogers: I believe that the dictionary grows linearly with the cardinality of the data (number of unique values.) --- Text: A ColumnDescriptor is an object that allows the use of Jackson's polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (for example: type, whether it's multi-value) and a list of serialization/deserialization logic that can deserialize the rest of the binary. Paul Rogers: Not user this paragraph is of value to most users. It seems of more interest to developers of Druid or extensions. --- Text: foo_2015-01-01/2015-01-02_v1_0 Paul Rogers: Since / is the Unix/Linux directory path separator, the actual file name probably uses an underscore. The format shown here may be for S3. The local file naming format (on my Mac) seems much different. Maybe get an update from the Druid folks? --- Text: By default, Druid string dimension columns use the values `''` and `null` interchangeably and numeric and metric columns can not represent `null` at all, instead coercing nulls to `0`. However, Druid also provides a SQL compatible null handling mode, which you can enable at the system level, through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, allows Druid to create segments _at ingestion time_ in which the string columns can distinguish `''` from `null`, and numeric columns which can represent `null` valued rows instead of `0`. Paul Rogers: This is a tricky area! As it turns out, Druid does both, and the effect is throughout the system, not just in the storage layer. In "SQL-compatible" mode, blanks and NULL values are distinct: a column can be NULL, '' or 'foo'. In "replace nulls with blanks" (legacy) mode. blanks are considered to be NULL, there is no NULL value, and it is impossible to store a blank string. Same story with numbers for 0 and NULL. This option must be set at the time that Druid is first installed. Choose wisely as behavior will be surprising if the setting is changed once the system contains data. This means that users should decide, when first installing Druid, if their app requires NULL (unknown) values, or if the incoming data uses blanks & zeros for missing values. Configure Druid accordingly. After that, the data stored in the system, and the computation engine, will all work consistently with that choice. In SQL, in non-SQL compatible mode (i.e. useDefaultValueForNull=true, a NULL constant in SQL will be treated as either a blank string or zero, depending on the data type. This topic really deserves its own section or page, since it pretty much means that Druid is two different systems and the user must choose which to use at the first installation. --- Text: String dimension columns contain no additional column structures in this mode, instead they reserve an additional dictionary entry for the `null` value. Numeric columns are stored in the segment with an additional bitmap in which the set bits indicate `null` valued rows. Paul Rogers: in this mode: in which mode? To keep things sane, perhaps have one section for SQL-compatible null behavior, another for Druid Native behavior. (I call it "Druid Native" because "replace nulls with blanks or zeros behavior" is too much of a mouthful.) --- compaction.md ---------------- Text: Apache Druid supports schema changes. Therefore, dimensions can be different across segments even if they are a part of the same data source. See [Different schemas among segments](../design/segments.md#segments-with-different-schemas). If the input segments have different dimensions, the resulting compacted segment include all dimensions of the input segments. Paul Rogers: "include all dimensions of the input segments" --> "includes the union of all columns across all the input segments". What happens if column c occurs in two input segments, but with differing types? Would be good to ask a Druid engineer what happens, then record that here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
