abhishekrb19 commented on code in PR #19146:
URL: https://github.com/apache/druid/pull/19146#discussion_r2968159699


##########
docs/configuration/index.md:
##########
@@ -1431,6 +1431,7 @@ Additional Peon configs include:
 |`druid.indexer.task.tmpStorageBytesPerTask`|Maximum number of bytes per task 
to be used to store temporary files on disk. This config is generally intended 
for internal usage. Attempts to set it are very likely to be overwritten by the 
TaskRunner that executes the task, so be sure of what you expect to happen 
before directly adjusting this configuration parameter. The config is 
documented here primarily to provide an understanding of what it means if/when 
someone sees that it has been set. A value of -1 disables this limit.  |-1|
 |`druid.indexer.task.allowHadoopTaskExecution`|Conditional dictating if the 
cluster allows `index_hadoop` tasks to be executed. `index_hadoop` is 
deprecated, and defaulting to false will force cluster operators to acknowledge 
the deprecation and consciously opt in to using index_hadoop with the 
understanding that it will be removed in the future.|false|
 |`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests 
served by a task's chat handler. Set to 0 to disable limiting.|0|
+|`druid.indexing.formats.maxStringLength`|Maximum number of characters to 
store per string dimension value. Longer values are truncated during ingestion. 
Does not apply to multi-value string dimensions. Set to 0 to disable. Can be 
overridden per-dimension using `maxStringLength` in the [dimension 
object](../ingestion/ingestion-spec.md#dimension-objects).|0 (no truncation)|

Review Comment:
   Should the default be -1? 0 seems counterintuitive 
   Also making it -1 will align with `maxBytesInMemory`, `maxParseExceptions` 
etc



##########
processing/src/main/java/org/apache/druid/segment/DefaultColumnFormatConfig.java:
##########
@@ -68,6 +68,21 @@ private static String validateMultiValueHandlingMode(
     return stringMultiValueHandlingMode;
   }
 
+  @Nullable
+  private static Integer validateMaxStringLength(@Nullable Integer 
maxStringLength)
+  {
+    if (maxStringLength != null && maxStringLength <= 0) {

Review Comment:
   The current documentation notes set 0 to disable, but this check would fail 
startup if an operator sets it to 0 right? Should this check be 
`maxStringLength < 0`
   
   If we do make the default be -1, then this check wouldn't be relevant anymore



##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -57,18 +58,38 @@ public class StringDimensionIndexer extends 
DictionaryEncodedColumnIndexer<int[]
   private final MultiValueHandling multiValueHandling;
   private final boolean hasBitmapIndexes;
   private final boolean hasSpatialIndexes;
+  private final int maxStringLength;
   private volatile boolean hasMultipleValues = false;
 
   public StringDimensionIndexer(
       @Nullable MultiValueHandling multiValueHandling,
       boolean hasBitmapIndexes,
       boolean hasSpatialIndexes
   )
+  {
+    this(multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, 
StringDimensionSchema.getDefaultMaxStringLength());
+  }
+
+  public StringDimensionIndexer(
+      @Nullable MultiValueHandling multiValueHandling,
+      boolean hasBitmapIndexes,
+      boolean hasSpatialIndexes,
+      int maxStringLength
+  )
   {
     super(new StringDimensionDictionary());
     this.multiValueHandling = multiValueHandling == null ? 
MultiValueHandling.ofDefault() : multiValueHandling;
     this.hasBitmapIndexes = hasBitmapIndexes;
     this.hasSpatialIndexes = hasSpatialIndexes;
+    this.maxStringLength = maxStringLength;
+  }
+
+  private String truncateIfNeeded(String value)

Review Comment:
   Please add a short javadoc here
   ```suggestion
    @Nullable
     private String truncateIfNeeded(@Nullable String value)
   ```
   
   For the future, it would be helpful to have a counter when this truncation 
occurs, so we have visibility into data integrity. The counter can then be 
periodically emitted, similar to thrownAway/dropped event metrics, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to