[ 
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755553#comment-17755553
 ] 

Zoltán Borók-Nagy commented on IMPALA-12373:
--------------------------------------------

I think we don't need NULL termination so we can store actually 11 chars with 
libc++'s technique.

I uploaded a simple implementation that works on little-endian architectures: 
[^small_string.cpp]

It uses the following representations:
{noformat}
  static constexpr int SMALL_LIMIT = 11;

  struct SmallStringRep {
    char buf[SMALL_LIMIT];
    char len;
  };
  
  struct __attribute__((__packed__)) LongStringRep {
    char* ptr;
    unsigned int len;
  };

  static_assert(sizeof(SmallStringRep) == sizeof(LongStringRep));

  union {
    SmallStringRep small_rep;
    LongStringRep long_rep;
  } rep;
{noformat}
The small string indicator bit is stored in the MSB of the last byte 
(small_rep.len). This works on little-endian architectures as this will be also 
the MSB of long_rep.len. On big-endian architectures we would still use the 
last byte of course, but we have to use the LSB of small_rep.len (which would 
be also the LSB of long_rep.len).

We can use one bit of length as Impala puts a 2GB hard-limit on string length: 
[https://impala.apache.org/docs/build/html/topics/impala_string.html]
(Otherwise we could swap the order of ptr and len in LongStringRep, and use the 
highest bit of the ptr which is unused in 64-bit architectures).

In little-endian we can get the len with masking:
{noformat}
  bool is_small() {
    return rep.small_rep.len & 0b10000000;
  }

  int len() {
    if (is_small()) {
      return rep.small_rep.len & 0b01111111;
    } else {
      return rep.long_rep.len;
    }
  }
{noformat}
In big-endian we would get the len with bit-shifting.

> Implement Small String Optimization for StringValue
> ---------------------------------------------------
>
>                 Key: IMPALA-12373
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12373
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>         Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
>   char* ptr;  // 8 byte
>   int len;    // 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the 
> bytes of the 'ptr'. Something like that:
> {noformat}
>   union {
>     char* ptr;
>     char small_buf[sizeof(ptr)];
>   };
>   int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to 
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to