Hi all!

I am in the process of running some tests for online machine learning in
data streams from social media. I came across apache-SAMOA and seemed like
a very interesting framework.
However it was not possible to figure out how to get it to test and train
using a sparse array of tf-idf feature vectors. I provide the data in the
standard WEKA arff format and although it run, the output is something
along the lines of:

2015-05-12 22:58:58,993 [main] INFO
>  com.yahoo.labs.samoa.evaluation.EvaluatorProcessor
> (EvaluatorProcessor.java:189) -
> com.yahoo.labs.samoa.evaluation.EvaluatorProcessorid = 0
> evaluation instances,classified instances,classifications correct
> (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent)
> 100.0,100.0,100.0,100.0,?
> 200.0,200.0,100.0,100.0,?
> 300.0,300.0,100.0,100.0,?
> 400.0,400.0,100.0,100.0,?
> 500.0,500.0,100.0,100.0,?
> 600.0,600.0,100.0,100.0,?
> 700.0,700.0,100.0,100.0,?
> 800.0,800.0,100.0,100.0,?
> 900.0,900.0,100.0,100.0,?
> 1000.0,1000.0,100.0,100.0,?
> 1100.0,1100.0,100.0,100.0,?
> 1200.0,1200.0,100.0,100.0,?
> 1300.0,1300.0,100.0,100.0,?
> 1400.0,1400.0,100.0,100.0,?
> 1500.0,1500.0,100.0,100.0,?
> 1600.0,1600.0,100.0,100.0,?
> 1700.0,1700.0,100.0,100.0,?
> 1800.0,1800.0,100.0,100.0,?
> 1900.0,1900.0,100.0,100.0,?



I have read the documentation on the SAMOA project page but I wasn't able
to figure out how to get classification results per instance.
Could you please point me to the right direction in terms of acceptable
formats SAMOA can use as stream input ? Is there a need for a labeled
training set to be included in the data ?

Any examples you could provide me with that are not already in the
documentation would be most welcome!


Kind Regards,

Ilias Bertsimas.

Reply via email to