Gabor Kaszab created IMPALA-10901:
-------------------------------------

             Summary: Clean up Datasketches serialization and deserialization
                 Key: IMPALA-10901
                 URL: https://issues.apache.org/jira/browse/IMPALA-10901
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
    Affects Versions: Impala 4.0.0
            Reporter: Gabor Kaszab


(copy-paste from a mail thread)

Regarding serialization using bytes as opposed to a stream. This has nothing to 
do with BINARY data type in Impala.
Currently I see in the Impala code something like this (simplified):
std::stringstream tmp;
sketch.serialize(tmp);
std::string str = tmp.str(); // in StringStreamToStringVal
StringVal result(context, str.size());
memcpy(result.ptr, str.c_str(), str.size());

You could do it faster like this:
auto bytes = sketch.serialize();
StringVal result(context, bytes.size());
memcpy(result.ptr, bytes.data() bytes.size());

Regarding unnecessary constructor during deserialization. I see a code like 
this (HLL is an example, but the pattern is the same):
datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); // 
construct an empty sketch, which is not needed
DeserializeDsSketch(src, &src_sketch); // pass it into a function, which will 
replace it by an assignment (hopefully a move, not copy)
// in the function
*sketch = T::deserialize((void*)serialized_sketch.ptr, serialized_sketch.len);

This can be accomplished like so avoiding unnecessary constructor:
datasketches::hll_sketch src_sketch = 
datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr, 
serialized_sketch.len);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to