I would be more concerned about style than speed -- symbols as strings
is an ancient Lisp technique in NLP, but IMO a Dict of strings would be
better style.

Also see http://juliastats.github.io/DataFrames.jl/stable/man/pooling/ .

Best,

Tamas

On Fri, Apr 22 2016, Lyndon White wrote:

> When tokenizing large files, 
> it is normal to end up with many many multiples of the same string.
>
> Normal julia strings are not interned.
> Which means if you accumulate a large list of tokens,
> you end up duplicating a lot of strings, which uses unnesc memory.
>
> When you are tokenizing documents that are multiple gigabytes long,
> this really adds up.
>
>
> `symbols` *are  *interned.
> Is there any downsides to using them, when an interned string is required?
>
> I tried testing them for it a while ago, and got Huge improvments in memory 
> use, and thus also in speed (allocating memory is expensive).
>
> There are not `convert` methods defined for switching between symbols and 
> strings but
> `string(::Symbol)` and `symbol(::AbstractString)` work.

Reply via email to