Re: Scalability of Vertical Hoeffding Tree

Shigeru Imai Tue, 02 May 2017 10:46:29 -0700

Hi Nicolas,

Thank you for your reply.


I will wait for SAMOA-65 to be available.

I tried a dataset with 100 numerical and 100 nominal attributes generated with 
RandomTreeGenerator, but that did not scale either. Again, the throughput 
remained 50 Mbytes/sec up to 32 VMs.

By the way, does the following scaling policy look good to you? Can I assume 
that changing the parallelism of LocalStatisticsProcessor is the only way to 
scale VHT? Or is there any other processor I should change the parallelism?
> * Scaling policy: assign one core per LocalStatisticsProcessor

Regards,
Shigeru


On 5/2/2017 10:25 AM, Nicolas Kourtellis wrote:
> Hi Shigeru,
>
> Thank you for the interest in the VHT algorithm and SAMOA. A couple of
> brief comments from first glance:
>
> - The particular connector with Kafka was not thoroughly tested and that is
> why it was not merged yet with the main.
> Some teams we are aware of are currently working on a proposed new
> connector, as you can see from this new open issue:
> https://issues.apache.org/jira/browse/SAMOA-65
>
> - Indeed, when we tested VHT with small set of attributes, the benefit of
> more resources was not obvious, especially in the throughput. Only when we
> scaled out the problem to thousands of attributes would scalability to more
> resources make sense.
>
> Hope this helps,
>
> Nicolas
>
>
>
> On Mon, May 1, 2017 at 10:35 PM, Shigeru Imai <[email protected]> wrote:
>
>> Hello,
>>
>> I am testing the scalability of Vertical Hoeffding Tree on SAMOA-Storm
>> consuming streams from Kafka. So far, I have tested up to 32 VMs of
>> m4.large on Amazon EC2; however, throughput does not improve almost at all.
>> Storm consumes streams at 30 Mbytes/sec from Kafka with 1 VM, and this
>> throughput stays almost the same up to 32 VMs.
>>
>> Here are the experimental settings:
>> * SAMOA: latest on github as of April 2017
>> * Storm: version 0.10.1
>> * Dataset: forest covertype (54 attributes, https://archive.ics.uci.edu/
>> ml/datasets/Covertype)
>> * Kafka connector: implementation proposed for SAMOA-40 (
>> https://github.com/apache/incubator-samoa/pull/32)
>> * Scaling policy: assign one core per LocalStatisticsProcessor
>> * Tested with Prequential Evaluation
>>
>> I read the Vertical Hoeffding Tree paper from IEEE BigData 2016, but I
>> could not find the information on how throughput of VHT scales when we add
>> more resources (it only shows relative performance improvements compared to
>> the standard Hoeffding tree).
>>
>> Has anyone scale VHT successfully with or without Kafka?  Is there any
>> tips to achieve high throughput with VHT?
>> I believe using datasets with more attributes leads to a better
>> scalability for VHT, so I am thinking to try that next, but I think 54
>> attributes should scale at least a little bit.
>>
>> Also, I found the following sleep of 1 second in
>> StormEntranceProcessingItem.java. It looks to me that this hinders high
>> throughput processing. Can we get rid of this sleep?
>>     public void nextTuple() {
>>       if (entranceProcessor.hasNext()) {
>>         Values value = newValues(entranceProcessor.nextEvent());
>>         collector.emit(outputStream.getOutputId(), value);
>>       } else
>>         Utils.sleep(1000);
>>       // StormTupleInfo tupleInfo = tupleInfoQueue.poll(50,
>>       // TimeUnit.MILLISECONDS);
>>       // if (tupleInfo != null) {
>>       // Values value = newValues(tupleInfo.getContentEvent());
>>       // collector.emit(tupleInfo.getStormStream().getOutputId(), value);
>>       // }
>>     }
>>
>> Any suggestions would be appreciated.
>>
>> Thank you,
>> Shigeru
>>
>> --
>> Shigeru Imai  <[email protected]>
>> Ph.D. candidate
>> Worldwide Computing Laboratory
>> Department of Computer Science
>> Rensselaer Polytechnic Institute
>> 110 8th Street, Troy, NY 12180, USA
>> http://wcl.cs.rpi.edu/
>>
>
>

Re: Scalability of Vertical Hoeffding Tree

Reply via email to