[
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755553#comment-17755553
]
Zoltán Borók-Nagy commented on IMPALA-12373:
--------------------------------------------
I think we don't need NULL termination so we can store actually 11 chars with
libc++'s technique.
I uploaded a simple implementation that works on little-endian architectures:
[^small_string.cpp]
It uses the following representations:
{noformat}
static constexpr int SMALL_LIMIT = 11;
struct SmallStringRep {
char buf[SMALL_LIMIT];
char len;
};
struct __attribute__((__packed__)) LongStringRep {
char* ptr;
unsigned int len;
};
static_assert(sizeof(SmallStringRep) == sizeof(LongStringRep));
union {
SmallStringRep small_rep;
LongStringRep long_rep;
} rep;
{noformat}
The small string indicator bit is stored in the MSB of the last byte
(small_rep.len). This works on little-endian architectures as this will be also
the MSB of long_rep.len. On big-endian architectures we would still use the
last byte of course, but we have to use the LSB of small_rep.len (which would
be also the LSB of long_rep.len).
We can use one bit of length as Impala puts a 2GB hard-limit on string length:
[https://impala.apache.org/docs/build/html/topics/impala_string.html]
(Otherwise we could swap the order of ptr and len in LongStringRep, and use the
highest bit of the ptr which is unused in 64-bit architectures).
In little-endian we can get the len with masking:
{noformat}
bool is_small() {
return rep.small_rep.len & 0b10000000;
}
int len() {
if (is_small()) {
return rep.small_rep.len & 0b01111111;
} else {
return rep.long_rep.len;
}
}
{noformat}
In big-endian we would get the len with bit-shifting.
> Implement Small String Optimization for StringValue
> ---------------------------------------------------
>
> Key: IMPALA-12373
> URL: https://issues.apache.org/jira/browse/IMPALA-12373
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Zoltán Borók-Nagy
> Priority: Major
> Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
> char* ptr; // 8 byte
> int len; // 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the
> bytes of the 'ptr'. Something like that:
> {noformat}
> union {
> char* ptr;
> char small_buf[sizeof(ptr)];
> };
> int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]