fan31415 opened a new pull request #28487:
URL: https://github.com/apache/spark/pull/28487


    ### What changes were proposed in this pull request?
   When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.
   
   
   ### Why are the changes needed?
   This is a bug. Here is a simple example to reproduce it.
   
   ```
   // create a df without vector size
   val df = Seq(
     (Vectors.dense(1.0), Vectors.dense(2.0))
   ).toDF("n1", "n2")
   
   // only set vector size hint for n1 column
   val hintedDf = new VectorSizeHint()
     .setInputCol("n1")
     .setSize(1)
     .transform(df)
   
   // assemble n1, n2
   val output = new VectorAssembler()
     .setInputCols(Array("n1", "n2"))
     .setOutputCol("features")
     .setHandleInvalid("keep")
     .transform(hintedDf)
   
   // because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
   output.show()
   ```
   
   Expected error message:
   
   ```
   Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
   ```
   
   Actual error message:
   
   ```
   Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
   ```
   
   This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Add test in VectorAssemblerSuite.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to