Hi,
In Carbondata currently LOG4J level "STATISTICS" is available to log. How ever information is incomplete to debug performance problems and it is not easy to see statistics and profiling information of one query at one place. So we need to relook and improve statistics and profiling. I have put some pointers and can discuss regarding the same. What to collect --------------- 1) Statistics of table/columns like no of files, no of blocks,no of blocklets 2) Profiling information required to debug peformance issue and resource utilization. scan statistics like row size,no of block or blocklets scanned, distribution info, scan buffer size. I/O and CPU/compute cost. driver index effectiveness: number of blocks hit executor index effectiveness: number of blocklet hit decoding and decompression cost and memory required. Cache statistics , hits, misses, memory occpied. Dictionary statistics: no of entries, dictionary load time, memory occupied. Btree statistics: no of entries, Btree load time, lookup cost, memory occupied. 3) Data load: load time, memory requried, encode, compress cost. 4) Spark time and Shuffle cost. How to collect: --------------- Check if can be plugin to spark metrics/counters system. Have decorator statistics RDD in between to get each rdd, to collect statistics or any method to get from spark. make it plug-able to integrate with other processing frameworks, so that we can get end 2 end statistics. Some thing like log4J with clean interfaces to put and retrieve information. Where to store: --------------- In separate table In logs History information , like it is stored in spark(may be json). Is spark history statistics logging separate to use across frameworks? Collector can collect statistics and can decide where to store. How to see: ----------- Command to retrieve various statistics and profiling info Connecting to other metrics displays like spark UI or ganglia. Links: ------ Profiling support in impala. http://www.cloudera.com/ documentation/enterprise/5-7-x/topics/impala_explain_plan.html#perf_profile Table and column statistics in impala. http://www.cloudera.com/ documentation/enterprise/5-8-x/topics/impala_perf_stats. html#perf_table_stats spark metrics collection http://spark.apache.org/docs/ latest/monitoring.html#metrics Regards, Venkata Ramana Gollamudi
