gortiz commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1882702995

   > Why do we need to store strings? We should probably use byte array right 
and avoid creating string in the first place?
   
   I think that is something we need to explore in the longer term. We would 
reduce the GC usage by a lot if we do that. 
   
   > Just to add some context, the reason why this was added in the first place 
was the fact that for certain workloads, byte -> String de-serialization was 
becoming the bottleneck.
   
   Sure, that is something we need to take into account and be careful with the 
implementation. This is specially problematic when strings are not 
[normalized](https://en.wikipedia.org/wiki/Unicode_equivalence). What I did in 
the past was to use a Str class that has two attributes: a ByteBuffer and 
String. When the Str is build from IO buffers, the bytebuffer is set to the 
slice and the String is set to null. When a materialization is needed (for 
example, the io buffer will be released or we need to compare the strings), a 
`materialize()` method is called. That method initializes the String and after 
that moment the String is always used. By doing so we can skip the String 
creation (and therefore heap allocation) in almost all cases where the Str is 
not used as aggregation key.
   
   > We did try the GC optimizations with -XX:+UseStringDeduplication (and 
others) but noticed elevated CPU usage affecting our query latencies.
   
   I may be wrong, but dictionaries are bound to the query lifetime, right? I 
mean, we create the dictionary when the segment is being queried and do not 
re-use it in following queries. If that is the case String Deduplication won't 
be useful at all because it is only used on Strings in the old generation.
   
   > I would recommend against using String.intern, see an authoritative source 
[here](https://shipilev.net/jvm/anatomy-quarks/10-string-intern/), which 
recommends manual interning over use of String.intern.
   
   I'm with Richard here. My experience with String.intern is bad. It is just 
better to use our own structure to intern Strings. Something as simple as a 
Guava Cache is usually better than String.intern.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to