WeichenXu123 commented on PR #43199:
URL: https://github.com/apache/spark/pull/43199#issuecomment-1749926141

   > It seems it is not consistent with existing impl: 1, existing impl accepts 
double type and vector type, this pr accepts double type and array type; 2, 
existing impl outputs vector type, this pr outputs array type; 3, existing impl 
doesn't have parameter `inputFeatureSizeList`
   
   Yes these differences are by design:
   
   > existing impl accepts double type and vector type; existing impl outputs 
vector type, this pr outputs array type; 
   
   All other spark connect ML estimators / transformers only support array type 
feature input for now, the reason is spark connect ML need to support either 
spark dataframe or pandas dataframe, we haven't think out how to support 
"ml.Vector" type  in pandas DataFrame. We could design and support it in future
   
   > existing impl doesn't have parameter `inputFeatureSizeList`
   
   Yes, I add `inputFeatureSizeList` for simplify implementation, legacy 
VectorAssembler uses "column metadata" or check dataset first row to get 
feature size of input columns, we can avoid extra reading first row request by 
setting `inputFeatureSizeList`. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to