Question related to lazy decoding optimzation

Yong Zhang Wed, 08 Mar 2017 10:32:52 -0800

Hi,

I watched one session of "Apache Carbondata" in Spark Submit 2017. The video is
here: https://www.youtube.com/watch?v=lhsAg2H_GXc.

[https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<https://www.youtube.com/watch?v=lhsAg2H_GXc>

Apache Carbondata: An Indexed Columnar File Format for Interactive Query by
Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc>
www.youtube.com
Realtime analytics over large datasets has become an increasing wide-spread
demand, over the past several years, Hadoop ecosystem has been continuously
evolv...

Starting from 23:10, the speaker talks about lazy decoding optimization, and
the example given in the speech is following:

"select c3, sum(c2) from t1 group by c3", and talked about that c3 can be
aggregated directly by the encoding value (Maybe integer, if let's say a String
type c3 is encoded as int). I assume this in fact is done even within Spark
executor engine, as the Speaker described.

But I really not sure that I understand this is possible, especially in the
Spark. If Carbondata is the storage format for a framework on one box, I can
image that and understand this value it brings. But for a distribute executing
engine, like Spark, the data will come from multi hosts. Spark has to
deserialize the data for grouping/aggregating (C3 in this case). Let's say that
even Spark dedicates this to underline storage engine somehow, how Carbondata
will make sure that all the value will be encoded in the same globally? Won't
it just encode consistently per file? Globally is just too expensive. But
without it, I don't know how this lazy decoding can work.

I am just start researching this project, so maybe there are something
underline I don't understand.

Thanks

Yong

Question related to lazy decoding optimzation

Reply via email to