Hi,

I watched one session of "Apache Carbondata" in Spark Submit 2017. The video is 
here: https://www.youtube.com/watch?v=lhsAg2H_GXc.

[https://i.ytimg.com/vi/lhsAg2H_GXc/maxresdefault.jpg]<https://www.youtube.com/watch?v=lhsAg2H_GXc>

Apache Carbondata: An Indexed Columnar File Format for Interactive Query by 
Jacky Li/Jihong Ma<https://www.youtube.com/watch?v=lhsAg2H_GXc>
www.youtube.com
Realtime analytics over large datasets has become an increasing wide-spread 
demand, over the past several years, Hadoop ecosystem has been continuously 
evolv...




Starting from 23:10, the speaker talks about lazy decoding optimization, and 
the example given in the speech is following:

"select  c3, sum(c2) from t1 group by c3", and talked about that c3 can be 
aggregated directly by the encoding value (Maybe integer, if let's say a String 
type c3 is encoded as int). I assume this in fact is done even within Spark 
executor engine, as the Speaker described.


But I really not sure that I understand this is possible, especially in the 
Spark. If Carbondata is the storage format for a framework on one box, I can 
image that and understand this value it brings. But for a distribute executing 
engine, like Spark, the data will come from multi hosts. Spark has to 
deserialize the data for grouping/aggregating (C3 in this case). Let's say that 
even Spark dedicates this to underline storage engine somehow, how Carbondata 
will make sure that all the value will be encoded in the same globally? Won't 
it just encode consistently per file? Globally is just too expensive. But 
without it, I don't know how this lazy decoding can work.


I am just start researching this project, so maybe there are something 
underline I don't understand.


Thanks


Yong

Reply via email to