On Friday, May 27, 2016 at 6:30:45 AM UTC-5, Camilo Roca wrote:
>
> If what I guess is right, the amount of chars that exist are finite,
>

Well, kind of - see Unicode.
 

> thus Clojure treats them like a "pool of charts".
>

Java has a small number of primitive value types - byte, short, int, long, 
double, float, boolean, and char. In the JVM there is no "pool of 
primitives" - all of these types (plus Objects) describe how values are 
actually stored. If you had a "pool" then something would have to refer to 
it and the pointer would be larger than the value in many cases.

Several of the boxed versions of these primitives do pool the boxed forms 
in a range. I think Character is pooled for 0-127. Integers use a larger 
extended pool. This is all defined in the Java spec. This is a Java 
feature, not a JVM one.
 

> The question is then why are not strings implemented as vectors of charts 
> instead of using the underlying Java String class? 
>

Java strings are stored as arrays of chars  (hand-waving over the actual 
tricky details of multi-byte character encoding) which is about as 
efficient as you can get in the JVM (once I implemented a hack to String 
itself to compress and decompress big char arrays - that was fun! totally 
not worth it). Clojure vectors are certainly not as memory efficient as a 
simple array - they are a tree structure of nodes containing arrays (hash 
array mapped tries).
 

> As by using the Java String new allocations would have to be performed 
> every time that a new string needs to be created, even if it contains 
> exactly the same information of an existing string.
>

True, although these operations (being so common) are highly optimized in 
the JVM and even in the garbage collector. G1 will now autodetect the reuse 
of literal strings and de-dupe them in the latest versions. In short, the 
JVM is optimizing strings way more than we ever would and the best thing to 
do is to leverage the common path there and assume it will continue to be 
as fast as possible. Java strings can also be interned which forces a 
literal string to be effectively cached in the JVM and reused - you can 
explicitly do that via String.intern() but literal strings in your source 
code are always interned. There are some interesting effects with interning 
though so it's good to tread carefully with it - we used to use it more 
inside Clojure but have backed off of it as it makes allocation slower.
 

> This might not be a big deal if the amount of strings is small of lazy 
> produced, but (at least in my case) when I needed to load a relatively 
> small text file (120 MB) fully into memory* then I started having memory 
> problems as (from what I saw with YourKit) I had lots of repeated strings.
>

Interning might be worth looking at it if you're seeing this. In particular 
for cases like reading a file into maps where the map keys are common - 
those keys can be interned.

You might also try the G1 string deduplication with something 
like -XX:+UseG1GC -XX:+UseStringDeduplication 
-XX:+PrintStringDeduplicationStatistics - the last to see the effects. 
Google around for more info on this and make sure you're on the latest JDK 
as it's evolved.
 

>
> I would like to know your thoughts on this idea and implications/problems 
> with it.
>
>
> ---------------------------------------------------------------------- 
> notes
> * yes I know that I could lazily analyze the whole file and thus avoid 
> having memory problems, but in some cases such as using sort-by or 
> group-by, there is no other alternative than holding the whole thing and 
> then process it.
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to