Hello,

I am testing the scalability of Vertical Hoeffding Tree on SAMOA-Storm 
consuming streams from Kafka. So far, I have tested up to 32 VMs of m4.large on 
Amazon EC2; however, throughput does not improve almost at all. Storm consumes 
streams at 30 Mbytes/sec from Kafka with 1 VM, and this throughput stays almost 
the same up to 32 VMs.

Here are the experimental settings:
* SAMOA: latest on github as of April 2017
* Storm: version 0.10.1
* Dataset: forest covertype (54 attributes, 
https://archive.ics.uci.edu/ml/datasets/Covertype)
* Kafka connector: implementation proposed for SAMOA-40 
(https://github.com/apache/incubator-samoa/pull/32)
* Scaling policy: assign one core per LocalStatisticsProcessor
* Tested with Prequential Evaluation

I read the Vertical Hoeffding Tree paper from IEEE BigData 2016, but I could 
not find the information on how throughput of VHT scales when we add more 
resources (it only shows relative performance improvements compared to the 
standard Hoeffding tree).

Has anyone scale VHT successfully with or without Kafka?  Is there any tips to 
achieve high throughput with VHT?
I believe using datasets with more attributes leads to a better scalability for 
VHT, so I am thinking to try that next, but I think 54 attributes should scale 
at least a little bit.

Also, I found the following sleep of 1 second in 
StormEntranceProcessingItem.java. It looks to me that this hinders high 
throughput processing. Can we get rid of this sleep?
    public void nextTuple() {
      if (entranceProcessor.hasNext()) {
        Values value = newValues(entranceProcessor.nextEvent());
        collector.emit(outputStream.getOutputId(), value);
      } else
        Utils.sleep(1000);
      // StormTupleInfo tupleInfo = tupleInfoQueue.poll(50,
      // TimeUnit.MILLISECONDS);
      // if (tupleInfo != null) {
      // Values value = newValues(tupleInfo.getContentEvent());
      // collector.emit(tupleInfo.getStormStream().getOutputId(), value);
      // }
    }

Any suggestions would be appreciated.

Thank you,
Shigeru

-- 
Shigeru Imai  <[email protected]>
Ph.D. candidate
Worldwide Computing Laboratory
Department of Computer Science
Rensselaer Polytechnic Institute
110 8th Street, Troy, NY 12180, USA
http://wcl.cs.rpi.edu/

Reply via email to