zhengruifeng edited a comment on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns URL: https://github.com/apache/spark/pull/25983#issuecomment-545870132 @tgravescs > VectorAssembler has to make a pass over the data and merge multiple columns. `VectorAssembler` only trigger a `first()` job to get the sizes of input vectors. > Many ML algorithms prefer columnar data and this allows the algorithm to determine what it wants to do with the columns. Do you mean column-based parallelism used in distributed tree building? Such function is not exposed to end users, and what you need to do is only to set params like `(..., updater=distcol)`. If some alg will benefit from column-based parallelism, I guess it is better to split the features internally. No alg in MLLibs is designed to fit/transform with column-based datasets for now, so I do not prefer to add this feature. > It is being used with XGBoost. I cannot find any related docs in [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#xgboost-parameters). Could you please provide a link for this?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
