clintropolis commented on code in PR #17082:
URL: https://github.com/apache/druid/pull/17082#discussion_r1765986843
##########
docs/ingestion/data-formats.md:
##########
@@ -125,6 +125,8 @@ Configure the CSV `inputFormat` to load CSV data as follows:
| columns | JSON array | Specifies the columns of the data. The columns should
be in the same order with the columns of your data. | yes if
`findColumnsFromHeader` is false or missing |
| findColumnsFromHeader | Boolean | If this is set, the task will find the
column names from the header row. Note that `skipHeaderRows` will be applied
before finding column names from the header. For example, if you set
`skipHeaderRows` to 2 and `findColumnsFromHeader` to true, the task will skip
the first two lines and then extract column information from the third line.
`columns` will be ignored if this is set to true. | no (default = false if
`columns` is set; otherwise null) |
| skipHeaderRows | Integer | If this is set, the task will skip the first
`skipHeaderRows` rows. | no (default = 0) |
+| shouldParseNumbers| Boolean| If this is set, the task will attempt to parse
numeric strings into long or double data type, in that order. If the value
cannot be parsed as a number, it is retained as a string. | no (default =
false) |
Review Comment:
i wonder if this should be `tryParseNumbers`? this is fine too though...
##########
processing/src/main/java/org/apache/druid/java/util/common/parsers/ParserUtils.java:
##########
@@ -52,22 +55,61 @@ public class ParserUtils
}
}
- public static Function<String, Object> getMultiValueFunction(
+ /**
+ * @return a function that processes a given string input by splitting it
into multiple values
+ * using the {@code listSplitter} if thge {@code list delimiter} is present
in the input. If {@code shouldParseNumbers}
+ * is enabled, the function will also try to parse any numeric values
present in the input -- integers as {@code Long}
+ * and floating-point numbers as {@code Double}.
+ */
+ public static Function<String, Object> getMultiValueAndParseNumbersFunction(
Review Comment:
I feel like maybe we should just give this a more generic name (and also the
fields that store it in various places), maybe something like
`getValueParseFunction` or something? It seems like we would just keep
expanding the utility of this function if we need to add more value
transformation stuff rather than adding separate functions. That said, I don't
feel super strongly about it so probably fine to leave as is too
##########
processing/src/main/java/org/apache/druid/java/util/common/parsers/ParserUtils.java:
##########
@@ -52,22 +55,61 @@ public class ParserUtils
}
}
- public static Function<String, Object> getMultiValueFunction(
+ /**
+ * @return a function that processes a given string input by splitting it
into multiple values
+ * using the {@code listSplitter} if thge {@code list delimiter} is present
in the input. If {@code shouldParseNumbers}
+ * is enabled, the function will also try to parse any numeric values
present in the input -- integers as {@code Long}
+ * and floating-point numbers as {@code Double}.
+ */
+ public static Function<String, Object> getMultiValueAndParseNumbersFunction(
final String listDelimiter,
- final Splitter listSplitter
+ final Splitter listSplitter,
+ final boolean shouldParseNumbers
)
{
return (input) -> {
- if (input != null && input.contains(listDelimiter)) {
+ if (input == null) {
+ return NullHandling.emptyToNullIfNeeded(input);
+ }
+
+ if (input.contains(listDelimiter)) {
return StreamSupport.stream(listSplitter.split(input).spliterator(),
false)
- .map(NullHandling::emptyToNullIfNeeded)
- .collect(Collectors.toList());
+ .map(NullHandling::emptyToNullIfNeeded)
+ .map(value -> shouldParseNumbers ?
ParserUtils.tryParseStringAsNumber(value) : value)
+ .collect(Collectors.toList());
} else {
- return NullHandling.emptyToNullIfNeeded(input);
+ return shouldParseNumbers ?
+ tryParseStringAsNumber(input) :
+ NullHandling.emptyToNullIfNeeded(input);
+
}
};
}
+ /**
+ * Attempts to parse the input string into a numeric value, if applicable.
If the input is a number, the method first
+ * tries to parse the input number as a {@code Long}. If parsing as a {@code
Long} fails, it then attempts to parse
+ * the input number as a {@code Double}. For all other scenarios, the input
is returned as-is as a {@code String} type.
+ */
+ @Nullable
+ private static Object tryParseStringAsNumber(@Nullable final String input)
+ {
+ if (!NumberUtils.isNumber(input)) {
Review Comment:
i wonder if this is worth looping over the string an extra time before we do
try parse attempts, or if we should just start with trying to parse it as a
long. I guess having this function call saves the double tryParse which uses a
regex pattern.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]