zhengruifeng opened a new pull request, #49649:
URL: https://github.com/apache/spark/pull/49649
### What changes were proposed in this pull request?
1, Fix the save/load of `TargetEncoder`
2, hide `TargetEncoderModel.stats`
### Why are the changes needed?
1, existing implementation of `save/load` actually does not work
2, in the python side, `TargetEncoderModel.stats` return a `JavaObject`
which cannot be used.
We should find a better way to expose the model coefficients.
```
In [1]: from pyspark.ml.feature import *
...:
...: df = spark.createDataFrame(
...: [
...: (0, 3, 5.0, 0.0),
...: (1, 4, 5.0, 1.0),
...: (2, 3, 5.0, 0.0),
...: (0, 4, 6.0, 1.0),
...: (1, 3, 6.0, 0.0),
...: (2, 4, 6.0, 1.0),
...: (0, 3, 7.0, 0.0),
...: (1, 4, 8.0, 1.0),
...: (2, 3, 9.0, 0.0),
...: ],
...: schema="input1 short, input2 int, input3 double, label double",
...: )
...: encoder = TargetEncoder(
...: inputCols=["input1", "input2", "input3"],
...: outputCols=["output", "output2", "output3"],
...: labelCol="label",
...: targetType="binary",
...: )
...: model = encoder.fit(df)
In [2]: model.stats
Out[2]: JavaObject id=o92
In [5]: model.write().overwrite().save("/tmp/ta")
In [6]: TargetEncoderModel.load("/tmp/ta")
{"ts": "2025-01-24 19:06:54,598", "level": "ERROR", "logger":
"DataFrameQueryContextLogger", "msg": "[UNRESOLVED_COLUMN.WITH_SUGGESTION] A
column, variable, or function parameter with name `encodings` cannot be
resolved. Did you mean one of the following? [`stats`]. SQLSTATE: 42703",
"context": {"file":
...
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable,
or function parameter with name `encodings` cannot be resolved. Did you mean
one of the following? [`stats`]. SQLSTATE: 42703;
'Project ['encodings]
+- Relation [stats#37] parquet
```
### Does this PR introduce _any_ user-facing change?
No, since this algorithm was 4.0 only
### How was this patch tested?
updated test
### Was this patch authored or co-authored using generative AI tooling?
no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]