Re: [racket-users] Re: Decision Tree in Racket - Performance

Zelphir Kaltstahl Thu, 27 Jul 2017 13:02:26 -0700

On Wednesday, July 26, 2017 at 3:17:46 PM UTC+2, gustavo wrote:
> I read the solution of Daniel Prager and I have a few minor changes.
> 
> * First I added a test that repeats `build-tree` 20 times, so the run
> time is approximately 1-2 seconds. This is not necessary, but times
> smaller than 1 second are sometimes not reliable. I'm using:
> ;---
> (random-seed 12345)
> (define data2 (shuffle banknote-data))
> (time
>  (void
>   (build-tree (take data2 274) 5 10)))
> (time
>  (for ([i (in-range 20)])
>    (build-tree (take data2 274) 5 10)))
> ;---
> 
> 
> 
> * Second I modified the split function so it is usable in Windows and
> Linux. I'm using:
>     (string-split s #rx"\r?\n")
> I'm almost sure this is not the best method to write portable code.
> Any recommendation?
> 
> 
> 
> * The most important change is to add `in-list` everywhere, i.e. replace
>     (for/sum ([split splits]) <body> ...)
> with
>     (for/sum ([split (in-list splits)]) <body> ...)
> 
> I'm not sure that all of them are necessary. The `for` clauses without
> `in-list` are nice because they are very generic and can iterate over
> a lot of different data types. The problem is that they create a
> auxiliary sequence and use the methods of the sequence to iterate, so
> they are slower than expected. If you use `in-list`, the `for` is
> expanded to code that is specific to lists and is almost as efficient
> as the hand coded version like (let loop ([l my-list]) ...) and
> sometimes better.
> 
> Most of the times you can use the generic version and use the code
> with any data type for free, but in the spots where you need fast code
> remember to use in-range, in-list, in-vector, ... (For the version
> that uses vectors for internal representation, you should probably use
> in-vector. I didn't look at the code.)
> 
> Before this change a typical run time of the test is
>    cpu time: 157 real time: 165 gc time: 0
>    cpu time: 3390 real time: 3393 gc time: 63      (20 times)
> 
> After this change a typical run time of the test is
>    cpu time: 62 real time: 61 gc time: 0
>    cpu time: 1266 real time: 1277 gc time: 62     (20 times)
> 
> Gustavo
> 
> PS: I made a PR with these changes in github.


Wow, are those timings for the "big" data set?!

This is about 20x faster than the vector solution. I wonder where I am loosing 
so much time. Still not finished implementing everything, but I can build a 
tree now with the vector solution. One guess I have is, that I am currently 
(partly for debugging purposes) storing subsets of data in each node of the 
tree, instead of only the split value. Another is, that I am using structs 
(instead of minimalistic lists? tagged data?) to represent nodes and they are 
constructed etc.

But other than that I have no clue what is using up so much time. Maybe I can 
time every procedure in DrRacket somehow and see where the most time is spend.

I am also curious about the Typed Racket version and consider doing this once I 
am done with the vector solution as well.

I don't really understand the (in-list ...) thing. This seems to be internal 
magic to me. Maybe it informs about the type of what is iterated over and that 
makes it faster?

When I tried I could not run @Daniel's solution. There was some kind of error. 
I am running Racket 6.8. Should I update to 6.9, could that be the issue?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Re: Decision Tree in Racket - Performance

Reply via email to