Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21347#discussion_r188820444
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
    @@ -125,6 +125,19 @@ private[spark] class Instrumentation[E <: 
Estimator[_]] private (
         log(compact(render(name -> value)))
       }
     
    +  def logNamedValue(name: String, value: Array[String]): Unit = {
    +    log(compact(render(name -> value.toSeq)))
    --- End diff --
    
    This is different from what we had for clustering. 
https://github.com/apache/spark/blob/075d678c8844614910b50abca07282bde31ef7e0/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L280.
    
    In that implementation, we logged the entire array as a JSON string, which 
is equivalent to 
    
    ~~~scala
    compact(render(name -> compact(render(value.toSeq)))))
    ~~~~
    
    instead of JSONify the raw data structure.  It really depends on how users 
load the data. Do they treat named value as a map or as sub-fields. If it is 
the former, we should have consistent value types and then we also need to 
update `logNamedValue(String, Long)`.  @jkbradley What is your preference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to