On Sun, Jul 23, 2017 at 7:30 PM, Zelphir Kaltstahl
<zelphirkaltst...@gmail.com> wrote:
> Hi Racket Users,
>
> The last few days I've been working on implementing decision trees in Racket 
> and I've been following the following guide: 
> http://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
>
> Now I have the following code: https://github.com/ZelphirKaltstahl/racket-ml
>
> I also wrote some tests, I think for every procedure so far.
>
> However, my implementation seems very very slow. It seems each iteration of 
> `iter-features` takes way too much time.
>
> I've tried to stick to the guide and sometimes "outsourced" some procedure.
>
> I started out with using vectors, as I thought I might gain better 
> performance than from lists. In the code I introduced an abstraction layer, 
> which provides things like `data-length`, so that I could in theory change 
> the representation of data and only change those accessors/getters. In the 
> test cases I sometimes did not use the abstraction though.
>
> So far I am not having much side effects in the code and I'd like to avoid 
> them and unsafe operations.
>
> A small `TEST-DATA` set is in the code and another data set I downloaded from 
> the data set repositories. When running with `TEST-DATA` to calculate the 
> best split, it only takes a few milliseconds, while it takes minutes with the 
> other `data-set`.
>
> How can I make my code more efficient, without changing the basic logic of it?
> Should I not use vectors (what else?)?
> Would I gain anything from using typed Racket or flonums?
>


I haven't taken a close enough look to evaluate the algorithm itself,
but on a micro-scale:

- Using `vector-take-right` to get the tail of a vector in a loop is
expensive. From what I can see, you'll be better off representing your
data as a list of vectors. (I see you have a `data-get-row` function,
suggesting a need for random access, but you don't appear to be using
it.)
- Use a struct instead of a hash to represent a split.
- `split-data` could partition the data list in a single pass, instead
of making two passes. (You can use the `partition` function from
racket/list.)
- If I'm reading this right, for a given data set, you should be able
to memoize calls to `data-get-col`.
- From a quick glance, it looks like you're already using floats,
rather than exact rationals.
- Are you running this code inside DrRacket? If so, have you timed the
difference between running it with debugging enabled and with no
debugging or profiling? (Language -> Choose Language... -> Details)

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to