Re: [PR] Fix Hadoop multi-value string null value handling to match native batch (druid)

via GitHub Fri, 23 Jan 2026 14:07:27 -0800


clintropolis commented on code in PR #18944:
URL: https://github.com/apache/druid/pull/18944#discussion_r2722981595



##########
indexing-hadoop/src/main/java/org/apache/druid/indexer/HadoopDruidIndexerConfig.java:
##########
@@ -129,10 +132,17 @@ public class HadoopDruidIndexerConfig
     );
     JSON_MAPPER = INJECTOR.getInstance(ObjectMapper.class);
     INDEX_IO = INJECTOR.getInstance(IndexIO.class);
-    INDEX_MERGER_V9 = INJECTOR.getInstance(IndexMergerV9.class);
     HADOOP_KERBEROS_CONFIG = INJECTOR.getInstance(HadoopKerberosConfig.class);
     DATA_SEGMENT_PUSHER = INJECTOR.getInstance(DataSegmentPusher.class);
     PROPERTIES = INJECTOR.getInstance(Properties.class);
+
+    boolean buildV10 = 
Boolean.parseBoolean(PROPERTIES.getProperty(BUILD_V10_KEY, "false"));

Review Comment:
   neat, :+1: I ignored hadoop for v10 format stuff, thanks for adding



##########
processing/src/main/java/org/apache/druid/data/input/Rows.java:
##########
@@ -74,15 +77,15 @@ public static List<String> objectToStrings(final Object 
inputValue)
       return Collections.emptyList();
     } else if (inputValue instanceof List) {
       // guava's toString function fails on null objects, so please do not use 
it
-      return ((List<?>) 
inputValue).stream().map(String::valueOf).collect(Collectors.toList());
+      return ((List<?>) 
inputValue).stream().map(Evals::asString).collect(Collectors.toList());

Review Comment:
   i think this is a good change because the old behavior was wack, but I'm 
still tracing through to try to determine the actual impacts of this change.
   
   Besides the hadoop impact, which you fix in this PR, this method seems like 
it will mostly impact callers of `Row.getDimension` as well as the `toGroupKey` 
method of this class since it calls `getDimension`.
   
   luckily there are a relatively small number of 'production' callers of these 
methods
   
   `Row.getDimension`:
   <img width="831" height="867" alt="Image" 
src="https://github.com/user-attachments/assets/5d04de7d-31d3-4358-9ef5-625fb0ee0472";
 />
   
   `Rows.toGroupKey`:
   <img width="897" height="697" alt="Image" 
src="https://github.com/user-attachments/assets/bc92592a-4f7f-4806-858b-634f6cc94cfc";
 />
   
   which look mostly related to partitioning. I think we need to determine if 
the `null` -> `'null'` coercion is important for these callers, and if so, do 
the coercion there. I'm uncertain currently but will keep trying to figure it 
out.



##########
extensions-core/avro-extensions/src/test/java/org/apache/druid/data/input/AvroStreamInputRowParserTest.java:
##########
@@ -366,8 +367,9 @@ static void assertInputRowCorrect(InputRow inputRow, 
List<String> expectedDimens
           Lists.transform(SOME_INT_ARRAY_VALUE, String::valueOf),
           inputRow.getDimension("someIntArray")
       );
+      // For string array, nulls are preserved so use ArrayList (ImmutableList 
doesn't support nulls)
       Assert.assertEquals(
-          Lists.transform(SOME_STRING_ARRAY_VALUE, String::valueOf),
+          SOME_STRING_ARRAY_VALUE.stream().map(v -> v == null ? null : 
String.valueOf(v)).collect(Collectors.toList()),

Review Comment:
   nit: this can use `Evals.asString`



##########
indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java:
##########
@@ -389,15 +389,20 @@ private static void writeStringArray(List<String> values, 
ByteArrayDataOutput ou
     }
   }
 
+  @Nullable
   private static String readString(DataInput in) throws IOException
   {
     byte[] result = readBytes(in);
-    return StringUtils.fromUtf8(result);
+    return result == null ? null : StringUtils.fromUtf8(result);

Review Comment:
   i wonder if we should make a `StringUtils.fromUtf8Nullable` that takes 
`byte[]` (we already have one for `ByteBuffer`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix Hadoop multi-value string null value handling to match native batch (druid)

Reply via email to