[
https://issues.apache.org/jira/browse/SPARK-31671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-31671.
----------------------------------
Fix Version/s: 2.4.7
3.0.0
Assignee: YijieFan
Resolution: Fixed
Resolved by https://github.com/apache/spark/pull/28487
> Wrong error message in VectorAssembler when column lengths can not be
> inferred
> -------------------------------------------------------------------------------
>
> Key: SPARK-31671
> URL: https://issues.apache.org/jira/browse/SPARK-31671
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.4.4
> Environment: Mac OS catalina
> Reporter: YijieFan
> Assignee: YijieFan
> Priority: Minor
> Fix For: 3.0.0, 2.4.7
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> In VectorAssembler when input column lengths can not be inferred and
> handleInvalid = "keep", it will throw a runtime exception with message like
> below
> _Can not infer column lengths with handleInvalid = "keep". *Consider using
> VectorSizeHint*_
> *_|to add metadata for columns: [column1, column2]_*
> However, even if you set vector size hint for *column1*, the message remains,
> and will not change to *[column2]* only. This is not consistent with the
> description in the error message.
> This introduce difficulties when I try to resolve this exception, for I do
> not know which column required vectorSizeHint. This is especially troublesome
> when you have a large number of columns to deal with.
> Here is a simple example:
>
> {code:java}
> // create a df without vector size
> val df = Seq(
> (Vectors.dense(1.0), Vectors.dense(2.0))
> ).toDF("n1", "n2")
> // only set vector size hint for n1 column
> val hintedDf = new VectorSizeHint()
> .setInputCol("n1")
> .setSize(1)
> .transform(df)
> // assemble n1, n2
> val output = new VectorAssembler()
> .setInputCols(Array("n1", "n2"))
> .setOutputCol("features")
> .setHandleInvalid("keep")
> .transform(hintedDf)
> // because only n1 has vector size, the error message should tell us to set
> vector size for n2 too
> output.show()
> {code}
> Expected error message:
>
> {code:java}
> Can not infer column lengths with handleInvalid = "keep". Consider using
> VectorSizeHint to add metadata for columns: [n2].
> {code}
> Actual error message:
> {code:java}
> Can not infer column lengths with handleInvalid = "keep". Consider using
> VectorSizeHint to add metadata for columns: [n1, n2].
> {code}
> I change one line in VectorAssembler.scala, so that it can work properly as
> expected.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]