Hello, I am testing the scalability of Vertical Hoeffding Tree on SAMOA-Storm consuming streams from Kafka. So far, I have tested up to 32 VMs of m4.large on Amazon EC2; however, throughput does not improve almost at all. Storm consumes streams at 30 Mbytes/sec from Kafka with 1 VM, and this throughput stays almost the same up to 32 VMs.
Here are the experimental settings: * SAMOA: latest on github as of April 2017 * Storm: version 0.10.1 * Dataset: forest covertype (54 attributes, https://archive.ics.uci.edu/ml/datasets/Covertype) * Kafka connector: implementation proposed for SAMOA-40 (https://github.com/apache/incubator-samoa/pull/32) * Scaling policy: assign one core per LocalStatisticsProcessor * Tested with Prequential Evaluation I read the Vertical Hoeffding Tree paper from IEEE BigData 2016, but I could not find the information on how throughput of VHT scales when we add more resources (it only shows relative performance improvements compared to the standard Hoeffding tree). Has anyone scale VHT successfully with or without Kafka? Is there any tips to achieve high throughput with VHT? I believe using datasets with more attributes leads to a better scalability for VHT, so I am thinking to try that next, but I think 54 attributes should scale at least a little bit. Also, I found the following sleep of 1 second in StormEntranceProcessingItem.java. It looks to me that this hinders high throughput processing. Can we get rid of this sleep? public void nextTuple() { if (entranceProcessor.hasNext()) { Values value = newValues(entranceProcessor.nextEvent()); collector.emit(outputStream.getOutputId(), value); } else Utils.sleep(1000); // StormTupleInfo tupleInfo = tupleInfoQueue.poll(50, // TimeUnit.MILLISECONDS); // if (tupleInfo != null) { // Values value = newValues(tupleInfo.getContentEvent()); // collector.emit(tupleInfo.getStormStream().getOutputId(), value); // } } Any suggestions would be appreciated. Thank you, Shigeru -- Shigeru Imai <[email protected]> Ph.D. candidate Worldwide Computing Laboratory Department of Computer Science Rensselaer Polytechnic Institute 110 8th Street, Troy, NY 12180, USA http://wcl.cs.rpi.edu/
