Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21347#discussion_r188820444
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -125,6 +125,19 @@ private[spark] class Instrumentation[E <:
Estimator[_]] private (
log(compact(render(name -> value)))
}
+ def logNamedValue(name: String, value: Array[String]): Unit = {
+ log(compact(render(name -> value.toSeq)))
--- End diff --
This is different from what we had for clustering.
https://github.com/apache/spark/blob/075d678c8844614910b50abca07282bde31ef7e0/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L280.
In that implementation, we logged the entire array as a JSON string, which
is equivalent to
~~~scala
compact(render(name -> compact(render(value.toSeq)))))
~~~~
instead of JSONify the raw data structure. It really depends on how users
load the data. Do they treat named value as a map or as sub-fields. If it is
the former, we should have consistent value types and then we also need to
update `logNamedValue(String, Long)`. @jkbradley What is your preference?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]