WeichenXu123 commented on PR #43199: URL: https://github.com/apache/spark/pull/43199#issuecomment-1749926141
> It seems it is not consistent with existing impl: 1, existing impl accepts double type and vector type, this pr accepts double type and array type; 2, existing impl outputs vector type, this pr outputs array type; 3, existing impl doesn't have parameter `inputFeatureSizeList` Yes these differences are by design: > existing impl accepts double type and vector type; existing impl outputs vector type, this pr outputs array type; All other spark connect ML estimators / transformers only support array type feature input for now, the reason is spark connect ML need to support either spark dataframe or pandas dataframe, we haven't think out how to support "ml.Vector" type in pandas DataFrame. We could design and support it in future > existing impl doesn't have parameter `inputFeatureSizeList` Yes, I add `inputFeatureSizeList` for simplify implementation, legacy VectorAssembler uses "column metadata" or check dataset first row to get feature size of input columns, we can avoid extra reading first row request by setting `inputFeatureSizeList`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
