clintropolis commented on code in PR #16885:
URL: https://github.com/apache/druid/pull/16885#discussion_r1724568567
##########
processing/src/main/java/org/apache/druid/frame/read/columnar/StringFrameColumnReader.java:
##########
@@ -92,18 +87,18 @@ public Column readRACColumn(Frame frame)
final Memory memory = frame.region(columnNumber);
validate(memory);
- final long positionOfLengths =
getStartOfStringLengthSection(frame.numRows(), false);
- final long positionOfPayloads = getStartOfStringDataSection(memory,
frame.numRows(), false);
-
- StringFrameColumn frameCol =
- new StringFrameColumn(
- frame,
- false,
- memory,
- positionOfLengths,
- positionOfPayloads,
- asArray || isMultiValue(memory) // Read MVDs as String arrays
- );
+ final boolean multiValue = isMultiValue(memory);
+ final long positionOfLengths =
getStartOfStringLengthSection(frame.numRows(), multiValue);
+ final long positionOfPayloads = getStartOfStringDataSection(memory,
frame.numRows(), multiValue);
+
+ StringFrameColumn frameCol = new StringFrameColumn(
+ frame,
+ multiValue,
+ memory,
+ positionOfLengths,
+ positionOfPayloads,
+ multiValue // Read MVDs as String arrays
Review Comment:
there are other subtle differences between arrays and mvds too, mvds for
example will never have an actual `null` value, only `[]` or `[null]`, while
with arrays all 3 values are distinct. At least this is true of normal
segments, i would assume that frames preserve this (didn't confirm yet).
Selectors on MVDs also change single element arrays into scalar string values,
while arrays never do this. Mvds are expected to spit out lists from their
selectors (or scalar values), while arrays spit out arrays. Mvds use dimension
selectors, arrays do not typically ever support dimension selectors, and so on.
I'd personally prefer everything entirely split so that arrays never
accidentally become mvds and mvds never accidentally become arrays. It seems
like it simplifies both string and string array implementations because the
behavior differs quite a bit in a lot of places and we can drop lots of
conditional checking. I guess the part that reads a rows worth of values could
be shared? That said, since there are only 2 implementations the cost of
duplication doesn't seem very high unless I'm missing something.
I guess this doesn't need to be done in this PR, and am happy to have
further discussion, i just would feel a lot safer if _all_ string array stuff
was completely split out of all string stuff, even if it shares formats.
Alternatively, maybe the string reader/writer _only_ ever handles single value
strings, and all the stuff is moved into the string array writer, and if
anything, the MVD reader/writer can subclass the array reader/writer to
override stuff like implementing a dimension selector or whatever.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]