On Friday, May 27, 2016 at 6:30:45 AM UTC-5, Camilo Roca wrote: > > If what I guess is right, the amount of chars that exist are finite, >
Well, kind of - see Unicode. > thus Clojure treats them like a "pool of charts". > Java has a small number of primitive value types - byte, short, int, long, double, float, boolean, and char. In the JVM there is no "pool of primitives" - all of these types (plus Objects) describe how values are actually stored. If you had a "pool" then something would have to refer to it and the pointer would be larger than the value in many cases. Several of the boxed versions of these primitives do pool the boxed forms in a range. I think Character is pooled for 0-127. Integers use a larger extended pool. This is all defined in the Java spec. This is a Java feature, not a JVM one. > The question is then why are not strings implemented as vectors of charts > instead of using the underlying Java String class? > Java strings are stored as arrays of chars (hand-waving over the actual tricky details of multi-byte character encoding) which is about as efficient as you can get in the JVM (once I implemented a hack to String itself to compress and decompress big char arrays - that was fun! totally not worth it). Clojure vectors are certainly not as memory efficient as a simple array - they are a tree structure of nodes containing arrays (hash array mapped tries). > As by using the Java String new allocations would have to be performed > every time that a new string needs to be created, even if it contains > exactly the same information of an existing string. > True, although these operations (being so common) are highly optimized in the JVM and even in the garbage collector. G1 will now autodetect the reuse of literal strings and de-dupe them in the latest versions. In short, the JVM is optimizing strings way more than we ever would and the best thing to do is to leverage the common path there and assume it will continue to be as fast as possible. Java strings can also be interned which forces a literal string to be effectively cached in the JVM and reused - you can explicitly do that via String.intern() but literal strings in your source code are always interned. There are some interesting effects with interning though so it's good to tread carefully with it - we used to use it more inside Clojure but have backed off of it as it makes allocation slower. > This might not be a big deal if the amount of strings is small of lazy > produced, but (at least in my case) when I needed to load a relatively > small text file (120 MB) fully into memory* then I started having memory > problems as (from what I saw with YourKit) I had lots of repeated strings. > Interning might be worth looking at it if you're seeing this. In particular for cases like reading a file into maps where the map keys are common - those keys can be interned. You might also try the G1 string deduplication with something like -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics - the last to see the effects. Google around for more info on this and make sure you're on the latest JDK as it's evolved. > > I would like to know your thoughts on this idea and implications/problems > with it. > > > ---------------------------------------------------------------------- > notes > * yes I know that I could lazily analyze the whole file and thus avoid > having memory problems, but in some cases such as using sort-by or > group-by, there is no other alternative than holding the whole thing and > then process it. > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.