On 20 November 2014 10:05, Greg Lee <[email protected]> wrote:
>
> Is there a faster way to do the following, which builds a dictionary of
> unique tokens and counts?
I share your frustration regarding this. It should be mentioned
though that converting tokens to integers is a fairly standard
performance hack in Natural Language Processing, even for C/C++ code.
I did exactly this for the syntacto-semantic parser I mentioned during
my talk at JuliaCon and at least in Julia it is fairly easy to
implement nice types that does the token to id, vice versa, mapping:
https://github.com/ninjin/allen/blob/master/src/structs.jl
I also agree that we should improve Julia when it comes to the
performance of strings and dictionaries, but for now I am waiting for
the upcoming major string code overhaul.
Pontus