[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

imatiach-msft Mon, 13 Feb 2017 16:08:25 -0800

Github user imatiach-msft commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16630#discussion_r100931445
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
    @@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary 
private[regression] (
       /** Number of instances in DataFrame predictions. */
       private[regression] lazy val numInstances: Long = predictions.count()
     
    +
    +  /**
    +   * Name of features. If the name cannot be retrieved from attributes,
    +   * set default names to "V1", "V2", and so on.
    +   */
    +  @Since("2.2.0")
    +  lazy val featureName: Array[String] = {
    +    val featureAttrs = AttributeGroup.fromStructField(
    +      dataset.schema(model.getFeaturesCol)).attributes
    +    if (featureAttrs == None) {
    --- End diff --
    
    if I do the example below in spark-shell:
    
    import org.apache.spark.ml.feature.HashingTF
    val tf = new HashingTF().setInputCol("x").setOutputCol("hash")
    val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4), 
Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0, 
Array("b","c"), 7))).toDF("y", "x", "z")
    val dfres = tf.transform(df)
    
    when doing show():
    scala> dfres.show
    +---+------+---+--------------------+
    |  y|     x|  z|                hash|
    +---+------+---+--------------------+
    |0.0|[a, b]|  4|(262144,[30913,22...|
    |1.0|[b, c]|  6|(262144,[28698,30...|
    |1.0|[a, c]|  7|(262144,[28698,22...|
    |0.0|[b, c]|  7|(262144,[28698,30...|
    +---+------+---+--------------------+
    
    but, when I look at schema:
    import org.apache.spark.ml.attribute.AttributeGroup
    scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes
    res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None
    
    scala> AttributeGroup.fromStructField(dfres.schema("hash"))
    res6: org.apache.spark.ml.attribute.AttributeGroup = 
{"ml_attr":{"num_attrs":262144}}
    
    but in this case the name should be of the form: hash_{#}
    instead of V{#}
    for example, when using VectorAssembler on the above:
    import org.apache.spark.ml.feature.VectorAssembler
    val va = new 
VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs")
    scala> va.transform(dfres).show()
    +---+------+---+--------------------+--------------------+
    |  y|     x|  z|                hash|             outputs|
    +---+------+---+--------------------+--------------------+
    |0.0|[a, b]|  4|(262144,[30913,22...|(262146,[1,30915,...|
    |1.0|[b, c]|  6|(262144,[28698,30...|(262146,[0,1,2870...|
    |1.0|[a, c]|  7|(262144,[28698,22...|(262146,[0,1,2870...|
    |0.0|[b, c]|  7|(262144,[28698,30...|(262146,[1,28700,...|
    +---+------+---+--------------------+--------------------+
    
    scala> 
print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get)
    [Lorg.apache.spark.ml.attribute.Attribute;@4416197b
    scala> 
AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get
    res22: Array[org.apache.spark.ml.attribute.Attribute] = 
Array({"type":"numeric","idx":0,"name":"y"}, 
{"type":"numeric","idx":1,"name":"z"}, 
{"type":"numeric","idx":2,"name":"hash_0"}, 
{"type":"numeric","idx":3,"name":"hash_1"}, 
{"type":"numeric","idx":4,"name":"hash_2"}, 
{"type":"numeric","idx":5,"name":"hash_3"}, 
{"type":"numeric","idx":6,"name":"hash_4"}, 
{"type":"numeric","idx":7,"name":"hash_5"}, 
{"type":"numeric","idx":8,"name":"hash_6"}, 
{"type":"numeric","idx":9,"name":"hash_7"}, 
{"type":"numeric","idx":10,"name":"hash_8"}, 
{"type":"numeric","idx":11,"name":"hash_9"}, 
{"type":"numeric","idx":12,"name":"hash_10"}, 
{"type":"numeric","idx":13,"name":"hash_11"}, 
{"type":"numeric","idx":14,"name":"hash_12"}, 
{"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam...
    
    you can see that the attributes are given the column name followed by the 
index.
    This seems like a bug in the VectorAssembler, because it is making the 
schema dense when it should be sparse, but regardless this seems to be the more 
official way to represent the name of the attributes instead of using a "V" 
followed by index - unless you have seen the "V" + index used elsewhere?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16630: [SPARK-19270][ML] Add summary table to GLM summar...

Reply via email to