LakshSingla commented on code in PR #16885:
URL: https://github.com/apache/druid/pull/16885#discussion_r1724365955
##########
processing/src/main/java/org/apache/druid/frame/read/columnar/StringFrameColumnReader.java:
##########
@@ -92,18 +87,18 @@ public Column readRACColumn(Frame frame)
final Memory memory = frame.region(columnNumber);
validate(memory);
- final long positionOfLengths =
getStartOfStringLengthSection(frame.numRows(), false);
- final long positionOfPayloads = getStartOfStringDataSection(memory,
frame.numRows(), false);
-
- StringFrameColumn frameCol =
- new StringFrameColumn(
- frame,
- false,
- memory,
- positionOfLengths,
- positionOfPayloads,
- asArray || isMultiValue(memory) // Read MVDs as String arrays
- );
+ final boolean multiValue = isMultiValue(memory);
+ final long positionOfLengths =
getStartOfStringLengthSection(frame.numRows(), multiValue);
+ final long positionOfPayloads = getStartOfStringDataSection(memory,
frame.numRows(), multiValue);
+
+ StringFrameColumn frameCol = new StringFrameColumn(
+ frame,
+ multiValue,
+ memory,
+ positionOfLengths,
+ positionOfPayloads,
+ multiValue // Read MVDs as String arrays
Review Comment:
Strings (mvd) and string arrays are laid out in a pretty similar fashion.
They have 3 sections:
1. Indicates the number of the elements in a single row
2. Indicates the end point of the continuously laid out elements ƒor a
particular row
3. Actual string value
So if the string array is [foo, bar], it would be laid out like:
Section 1: .....2....
Section 2: ....190, 193......
Section 3: ....foo, bar,.....
This is the same for MV strings as well as the string arrays. However,
single-value strings don't need the first section, therefore they omit it. This
leads to 2 different formats a column is represented, and that is handled by
the "multiValue" flag (unfortunate naming I guess, but it doesn't have anything
to do with string-multi valuedness).
From what I know, the duplication b/w the string arrays and mv strings is
only because they are laid out in the same way. It doesn't impact how other
parts of the system refer to it, the frames themselves don't store the column
type. If window functions begin treating string arrays as multi-value strings,
it shouldn't be because of this, but because something upstream is telling it
to. With the arrayIngestMode, it happened because of the incorrect
implementation in the `DimensionHandlerUtils` afaik.
That being said, if you as a reader got confused, there's more to what we
can do to separate it and make it cleaner:
1. Separate out the two format implementations in a static block - strings
can use either one depending on the flag, and arrays must always use the one
where there are three sections (the one mentioned)
2. Use an alternative naming to "multiValue" to refer the layout so that it
doesn't look like we are coercing anything
3. Kill all mentions of "multivalue" in the string array frame column
4. Separate out everything at the cost of a little duplication (though I'd
really like the core logic of reading the values to be kept in a helper method,
if possible, since it's easier to fix bugs that way)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]