Gabor Kaszab created IMPALA-10901:
-------------------------------------
Summary: Clean up Datasketches serialization and deserialization
Key: IMPALA-10901
URL: https://issues.apache.org/jira/browse/IMPALA-10901
Project: IMPALA
Issue Type: Improvement
Components: Backend
Affects Versions: Impala 4.0.0
Reporter: Gabor Kaszab
(copy-paste from a mail thread)
Regarding serialization using bytes as opposed to a stream. This has nothing to
do with BINARY data type in Impala.
Currently I see in the Impala code something like this (simplified):
std::stringstream tmp;
sketch.serialize(tmp);
std::string str = tmp.str(); // in StringStreamToStringVal
StringVal result(context, str.size());
memcpy(result.ptr, str.c_str(), str.size());
You could do it faster like this:
auto bytes = sketch.serialize();
StringVal result(context, bytes.size());
memcpy(result.ptr, bytes.data() bytes.size());
Regarding unnecessary constructor during deserialization. I see a code like
this (HLL is an example, but the pattern is the same):
datasketches::hll_sketch src_sketch(DS_SKETCH_CONFIG, DS_HLL_TYPE); //
construct an empty sketch, which is not needed
DeserializeDsSketch(src, &src_sketch); // pass it into a function, which will
replace it by an assignment (hopefully a move, not copy)
// in the function
*sketch = T::deserialize((void*)serialized_sketch.ptr, serialized_sketch.len);
This can be accomplished like so avoiding unnecessary constructor:
datasketches::hll_sketch src_sketch =
datasketches::hll_sketch::deserialize((void*)serialized_sketch.ptr,
serialized_sketch.len);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)