Hi Minudika,

Which options should be available for the user to get decisions? For
> example, if the user is going to use the bagging method, the number of
> samples can be pre-defined by the user.


I think that depends on the implementation. As you've mentioned, # of
samples, would definitely has to be a user-input. Other than that, sample
size, algorithm to be used, its hyper-parameters, aggregation criteria (if
there are multiple ways of aggregating), etc might have to be taken from
the user.

Similarly, for stacking, we might have to get from user: the # of models,
algorithm for each model, hyper-parameters for each model, algorithm for
aggregation, and etc..

Regards,
Supun

On Wed, Mar 2, 2016 at 3:16 AM, Minudika Malshan <[email protected]>
wrote:

> Hi,
>
> Please help me to be clarified on this.
> Apart from the implementation of those ensemble methods at the back end,
> we're supposed to develop some UI features.
> Which options should be available for the user to get decisions? For
> example, if the user is going to use the bagging method, the number of
> samples can be pre-defined by the user.
>
> Regards,
> Minudika
>
> Minudika Malshan
> Undergraduate
> Department of Computer Science and Engineering
> University of Moratuwa.
>
>
>
>
> On Mon, Feb 29, 2016 at 11:04 AM, Supun Sethunga <[email protected]> wrote:
>
>> Hi Minudika,
>>
>> Thank you for your interest in the project.
>>
>> GBT and Random Forest are well known ensemble methods, and are readily
>> available as a single algorithm OOB in spark. So we need not be
>> implementing them again. You may treat them as any other simpler algorithm,
>> for the project.
>>
>> Let me clarify the few things. For ensemble methods, you can consider the
>> following three options:
>>
>>    - Stacking - Training multiple algos on the same data, and combining
>>    them using another algo.
>>    - Bagging - Training a single algo over subsets of data.
>>    - Boosting - Training multiple algos on the same data, and combining
>>    them over a weighted average.
>>
>> Personally I would prefer picking Stacking (since Boosting is a special
>> case of Stacking, later would cover both) and Bagging for
>> implementation, but you may pick appropriately. AFAIK these three methods
>> are not available OOB in spark. (except for Boosting in GBT and and bagging
>> in Random Forest).  Expectation of the project is to implement such a
>> logic, where a user can use any algorithm(s), pick the ensemble method, and
>> train a model.
>>
>> For bagging, you can use sampling techniques available in spark (eg:
>> rdd.sample(), df.sample() etc ) [1].
>>
>> Please do let us know if you need further clarifications.
>>
>> [1] http://spark.apache.org/docs/latest/api/java/
>>
>> Regards,
>> Supun
>>
>> On Mon, Feb 29, 2016 at 12:07 AM, Minudika Malshan <[email protected]
>> > wrote:
>>
>>> Hi,
>>>
>>> I found out that spark.ml Lib supports two ensemble algorithms, GBT and
>>> Random Forest.
>>> Will it be possible to implement Bagging  and boosting methods using ml
>>> Lib features?
>>>
>>> Also I'm grateful if you can give me some resources to getting started
>>> with implementation of Bagging method using ml Lib functionalities. If
>>> there's any other library which is allowed to use for this implementation,
>>> please let me know.
>>>
>>> Thanks and regards.
>>> Minudika
>>>
>>> Minudika Malshan
>>> Undergraduate
>>> Department of Computer Science and Engineering
>>> University of Moratuwa.
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> [email protected]
>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>
>>>
>>
>>
>> --
>> *Supun Sethunga*
>> Software Engineer
>> WSO2, Inc.
>> http://wso2.com/
>> lean | enterprise | middleware
>> Mobile : +94 716546324
>>
>
>


-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to