Hi folks,

I generated a bunch of parquet files using spark and
ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
which is a string column. It also has a "timestamp" column of int64. After
the files have been generated, I inspected the file footers and noticed
that only the "timestamp" field has min/max statistics. My primary filter
will be deviceId, the data is partitioned and sorted by deviceId, but since
the statistics data is missing, it's not able to prune blocks from being
read. Am I missing some configuration setting that allows it to generate
the stats data? The following is code is how an RDD[Thrift] is being saved
to parquet. The configuration is default configuration.

implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
ClassTag](rdd: RDD[T]) {
  def saveAsParquet(output: String,
                    conf: Configuration =
rdd.context.hadoopConfiguration): Unit = {
    val job = Job.getInstance(conf)
    val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
    ParquetThriftOutputFormat.setThriftClass(job, clazz)
    val r = rdd.map[(Void, T)](x => (null, x))
      .saveAsNewAPIHadoopFile(
        output,
        classOf[Void],
        clazz,
        classOf[ParquetThriftOutputFormat[T]],
        job.getConfiguration)
  }
}


Thanks,
Pradeep

Reply via email to