purijatin commented on a change in pull request #23549: [SPARK-26616][MLlib]
Expose document frequency in IDFModel
URL: https://github.com/apache/spark/pull/23549#discussion_r248901834
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala
##########
@@ -178,10 +188,19 @@ object IDFModel extends MLReadable[IDFModel] {
val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
val dataPath = new Path(path, "data").toString
val data = sparkSession.read.parquet(dataPath)
- val Row(idf: Vector) = MLUtils.convertVectorColumnsToML(data, "idf")
- .select("idf")
- .head()
- val model = new IDFModel(metadata.uid, new
feature.IDFModel(OldVectors.fromML(idf)))
+
+ val model = if (majorVersion(metadata.sparkVersion) >= 3) {
+ val Row(idf: Vector, df: Seq[_], numDocs: Long) = data.select("idf",
"docFreq", "numDocs")
+ .head()
+ new IDFModel(metadata.uid, new feature.IDFModel(OldVectors.fromML(idf),
+ df.asInstanceOf[Seq[Long]].toArray, numDocs))
+ } else {
+ val Row(idf: Vector) = MLUtils.convertVectorColumnsToML(data, "idf")
+ .select("idf")
+ .head()
+ new IDFModel(metadata.uid,
Review comment:
Yes.
Another problem with returning option is that, people using the class from
spark 3 onwards, will get an option. And it would be confusing because it may
mean that docFreq may not be present. But it always would be. We would then
need to have doc explaining the scenario of None,complicating the doc.
Secondly it would need to remain consistent with other models. Where all new
fields added should return option.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]