Chronos-LYH opened a new issue, #5631:
URL: https://github.com/apache/iceberg/issues/5631
### Feature Request / Improvement
In ML scenarios, we may want Iceberg schemas to include additional
information about a field. For example:
For an integer field representing a feature, we need information indicating
whether the feature is continuous or categorical:
```
{“type”: “continuous”}
{“type”: “categorical”, “categories”: [“US”, “CA”, “CN”, ...]}
```
For a list field representing multiple features, we may want information on
some of the features:
```
{
"features": [
{
"index": 0,
"name": "age",
"type": "continuous"
},
{
"index": 5,
"name": "gender",
"type": "categorical",
"categories": [
"male",
"female"
]
}
]
}
```
For a binary field representing a custom-encoded feature, we need
information on the encoding.
```
{"encoding": "feature_id_v1"}
{"encoding": "feature_id_v2"}
```
Spark has a metadata field in its StructType class since Spark 1.2
(https://issues.apache.org/jira/browse/SPARK-3569), so that Spark DataFrames
can hold ML-specific information as mentioned above.
Referring to Spark's implementation, Iceberg can add a metadata field to the
NestedField class. When declaring an Iceberg NestedField, users can provide an
"metadata" argument with additional information about the field.
```
Schema schema = new Schema(
required(1, "feature1", Types.IntegerType.get(), null,
Metadata.fromJson("
{“type”: “categorical”, “categories”: [“US”, “CA”, “CN”]}
"))),
required(2, "feature2", Types.IntegerType.get(), null,
Metadata.fromJson("
{“type”: “continuous”}
")))
);
```
The metadata in Iceberg and Spark should be able to convert to each other,
so that the field metadata in Iceberg can be passed to Spark DataFrames. Also
DataFrames will be able to preserve field metadata when saved in iceberg format.
See also:
https://docs.google.com/document/d/1RGJgVJhCebnilpL15ODcq0EWBeVjl9ltoHUvosWodPg/edit#
### Query engine
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]