Github user imatiach-msft commented on a diff in the pull request:
https://github.com/apache/spark/pull/16630#discussion_r100931445
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
---
@@ -915,6 +917,22 @@ class GeneralizedLinearRegressionSummary
private[regression] (
/** Number of instances in DataFrame predictions. */
private[regression] lazy val numInstances: Long = predictions.count()
+
+ /**
+ * Name of features. If the name cannot be retrieved from attributes,
+ * set default names to "V1", "V2", and so on.
+ */
+ @Since("2.2.0")
+ lazy val featureName: Array[String] = {
+ val featureAttrs = AttributeGroup.fromStructField(
+ dataset.schema(model.getFeaturesCol)).attributes
+ if (featureAttrs == None) {
--- End diff --
if I do the example below in spark-shell:
import org.apache.spark.ml.feature.HashingTF
val tf = new HashingTF().setInputCol("x").setOutputCol("hash")
val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4),
Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0,
Array("b","c"), 7))).toDF("y", "x", "z")
val dfres = tf.transform(df)
when doing show():
scala> dfres.show
+---+------+---+--------------------+
| y| x| z| hash|
+---+------+---+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|
|1.0|[b, c]| 6|(262144,[28698,30...|
|1.0|[a, c]| 7|(262144,[28698,22...|
|0.0|[b, c]| 7|(262144,[28698,30...|
+---+------+---+--------------------+
but, when I look at schema:
import org.apache.spark.ml.attribute.AttributeGroup
scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes
res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None
scala> AttributeGroup.fromStructField(dfres.schema("hash"))
res6: org.apache.spark.ml.attribute.AttributeGroup =
{"ml_attr":{"num_attrs":262144}}
but in this case the name should be of the form: hash_{#}
instead of V{#}
for example, when using VectorAssembler on the above:
import org.apache.spark.ml.feature.VectorAssembler
val va = new
VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs")
scala> va.transform(dfres).show()
+---+------+---+--------------------+--------------------+
| y| x| z| hash| outputs|
+---+------+---+--------------------+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|(262146,[1,30915,...|
|1.0|[b, c]| 6|(262144,[28698,30...|(262146,[0,1,2870...|
|1.0|[a, c]| 7|(262144,[28698,22...|(262146,[0,1,2870...|
|0.0|[b, c]| 7|(262144,[28698,30...|(262146,[1,28700,...|
+---+------+---+--------------------+--------------------+
scala>
print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get)
[Lorg.apache.spark.ml.attribute.Attribute;@4416197b
scala>
AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get
res22: Array[org.apache.spark.ml.attribute.Attribute] =
Array({"type":"numeric","idx":0,"name":"y"},
{"type":"numeric","idx":1,"name":"z"},
{"type":"numeric","idx":2,"name":"hash_0"},
{"type":"numeric","idx":3,"name":"hash_1"},
{"type":"numeric","idx":4,"name":"hash_2"},
{"type":"numeric","idx":5,"name":"hash_3"},
{"type":"numeric","idx":6,"name":"hash_4"},
{"type":"numeric","idx":7,"name":"hash_5"},
{"type":"numeric","idx":8,"name":"hash_6"},
{"type":"numeric","idx":9,"name":"hash_7"},
{"type":"numeric","idx":10,"name":"hash_8"},
{"type":"numeric","idx":11,"name":"hash_9"},
{"type":"numeric","idx":12,"name":"hash_10"},
{"type":"numeric","idx":13,"name":"hash_11"},
{"type":"numeric","idx":14,"name":"hash_12"},
{"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam...
you can see that the attributes are given the column name followed by the
index.
This seems like a bug in the VectorAssembler, because it is making the
schema dense when it should be sparse, but regardless this seems to be the more
official way to represent the name of the attributes instead of using a "V"
followed by index - unless you have seen the "V" + index used elsewhere?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]