tgravescs commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns URL: https://github.com/apache/spark/pull/25983#issuecomment-548811783 Right but if its not in Spark then multiple algos/libs can't use it and will likely re-invent the same thing. People are not going to see it in xgboost and then decide to add it to Spark. This is why I encouraged Liangcai to propose it. To me its useful to XGBoost so the question is then could it be useful to other algorithms in the future. To me I thought that answer was yes for the following reasons: - Simpler, more user friendly API - more obvious to user how to send in multiple features - user doesn't have to create VectorAssembler - algorithm could behind the scene create the Vector if required - algorithm could skip if its not required - don't lose the information about original column names - give the algorithm more flexibility - more performant if can skip the VectorAssembler - If GPU algorithm could be a lot more performant if data is already columnar Then on top of that, its really not much code so I didn't think it introduced much maintenance on the Spark side. We could mark it with Experimental as well and remove it if really not used by anyone else. I'm by no means an expert in all ML algorithms, so if that doesn't make sense and you don't think matches way everyone is doing it then we can just leave it in XGboost then.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
