[GitHub] [spark] tgravescs commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns

GitBox Fri, 01 Nov 2019 07:38:30 -0700

tgravescs commented on issue #25983: [SPARK-29327][MLLIB]Support specifying 
features via multiple columns
URL: https://github.com/apache/spark/pull/25983#issuecomment-548811783
 
 
   Right but if its not in Spark then multiple algos/libs can't use it and will 
likely re-invent the same thing. People are not going to see it in xgboost and 
then decide to add it to Spark.  This is why I encouraged Liangcai to propose 
it.  To me its useful to XGBoost so the question is then could it be useful to 
other algorithms in the future. To me I thought that answer was yes for the 
following reasons:
   
   - Simpler, more user friendly API
     - more obvious to user how to send in multiple features
     - user doesn't have to create VectorAssembler
     - algorithm could behind the scene create the Vector if required
     - algorithm could skip if its not required
     - don't lose the information about original column names
     - give the algorithm more flexibility
   - more performant if can skip the VectorAssembler
   - If GPU algorithm could be a lot more performant if data is already columnar
   
   Then on top of that, its really not much code so I didn't think it 
introduced much maintenance on the Spark side.  We could mark it with 
Experimental as well and remove it if really not used by anyone else.
   
   I'm by no means an expert in all ML algorithms, so if that doesn't make 
sense and you don't think matches way everyone is doing it then we can just 
leave it in XGboost then.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns

Reply via email to