LakshSingla commented on code in PR #16885:
URL: https://github.com/apache/druid/pull/16885#discussion_r1724365955


##########
processing/src/main/java/org/apache/druid/frame/read/columnar/StringFrameColumnReader.java:
##########
@@ -92,18 +87,18 @@ public Column readRACColumn(Frame frame)
     final Memory memory = frame.region(columnNumber);
     validate(memory);
 
-    final long positionOfLengths = 
getStartOfStringLengthSection(frame.numRows(), false);
-    final long positionOfPayloads = getStartOfStringDataSection(memory, 
frame.numRows(), false);
-
-    StringFrameColumn frameCol =
-        new StringFrameColumn(
-            frame,
-            false,
-            memory,
-            positionOfLengths,
-            positionOfPayloads,
-            asArray || isMultiValue(memory) // Read MVDs as String arrays
-        );
+    final boolean multiValue = isMultiValue(memory);
+    final long positionOfLengths = 
getStartOfStringLengthSection(frame.numRows(), multiValue);
+    final long positionOfPayloads = getStartOfStringDataSection(memory, 
frame.numRows(), multiValue);
+
+    StringFrameColumn frameCol = new StringFrameColumn(
+        frame,
+        multiValue,
+        memory,
+        positionOfLengths,
+        positionOfPayloads,
+        multiValue // Read MVDs as String arrays

Review Comment:
   Strings (mvd) and string arrays are laid out in a pretty similar fashion. 
They have 3 sections:
   1. Indicates the number of the elements in a single row
   2. Indicates the end point of the continuously laid out elements ƒor a 
particular row
   3. Actual string value
   
   So if the string array is [foo, bar], it would be laid out like:
   Section 1: .....2....
   Section 2: ....190, 193......
   Section 3: ....foo, bar,.....
   
   This is the same for MV strings as well as the string arrays. However, 
single-value strings don't need the first section, therefore they omit it. This 
leads to 2 different formats a column is represented, and that is handled by 
the "multiValue" flag (unfortunate naming I guess, but it doesn't have anything 
to do with string-multi valuedness). 
   
   <hr>
   
   From what I know, the duplication b/w the string arrays and mv strings is 
only because they are laid out in the same way. It doesn't impact how other 
parts of the system refer to it, the frames themselves don't store the column 
type. If window functions begin treating string arrays as multi-value strings, 
it shouldn't be because of this, but because something upstream is telling it 
to. With the arrayIngestMode, it happened because of the incorrect 
implementation in the `DimensionHandlerUtils` afaik. 
   
   That being said, if you as a reader got confused, there's more to what we 
can do to separate it and make it cleaner:
   1. Separate out the two format implementations in a static block - strings 
can use either one depending on the flag, and arrays must always use the one 
where there are three sections (the one mentioned)
   2. Use an alternative naming to "multiValue" to refer the layout so that it 
doesn't look like we are coercing anything
   3. Kill all mentions of "multivalue" in the string array frame column reader 
implementation
   4. Separate out everything at the cost of a little duplication (though I'd 
really like the core logic of reading the values to be kept in a helper method, 
if possible, since it's easier to fix bugs that way)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to