[GitHub] [spark] zhengruifeng commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns

GitBox Thu, 24 Oct 2019 04:15:03 -0700

zhengruifeng commented on issue #25983: [SPARK-29327][MLLIB]Support specifying 
features via multiple columns
URL: https://github.com/apache/spark/pull/25983#issuecomment-545870132
 
 
   > VectorAssembler has to make a pass over the data and merge multiple 
columns.
   `VectorAssembler` only trigger a `first()` job to get the sizes of input 
vectors.
   
   > Many ML algorithms prefer columnar data and this allows the algorithm to 
determine what it wants to do with the columns.
   Do you mean column-based parallelism used in distributed tree building? Such 
function is not exposed to end users, and what you need to do is only to set 
params like `(..., updater=distcol)`.
   If some alg will benefit from column-based parallelism, I guess it is better 
to split the features internally. No alg in MLLibs is designed to fit/transform 
with column-based datasets for now, so I do not prefer to add this feature.
   
   > It is being used with XGBoost.
   I cannot find any related docs in [XGBoost 
Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#xgboost-parameters).
 Could you please provide a link for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns

Reply via email to