Re: Missing min/max statistics in file footer

Lars Volker Fri, 10 Feb 2017 12:40:05 -0800

In that case I don't see why reading the stats shouldn't work, assuming
they are in the file in the first place. I don't know why writing them
would fail, so unless someone else can help you, you may have to debug the
code that writes them.


On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota <[email protected]>
wrote:

> metadata.getFileMetadata().createdBy() shows this "parquet-mr version
> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"
>
> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
> PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>
>
> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <[email protected]> wrote:
>
> > Can you check the value of ParquetMetaData.created_by? Once you have
> that,
> > you should see if it gets filtered by the code in CorruptStatistics.java.
> >
> > On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <[email protected]
> >
> > wrote:
> >
> > > Data was written with Spark but I'm using the parquet APIs directly for
> > > reads. I checked the stats in the footer with the following code.
> > >
> > > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > > ParquetMetadataConverter.NO_FILTER);
> > > ColumnPath deviceId = ColumnPath.get("deviceId");
> > > metadata.getBlocks().forEach(b -> {
> > >     if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> > >         System.out.println("\nBlockSize = " + b.getTotalByteSize());
> > >         System.out.println("ComprSize = " + b.getCompressedSize());
> > >         System.out.println("Num Rows  = " + b.getRowCount());
> > >         b.getColumns().forEach(c -> {
> > >             if (c.getPath().equals(deviceId)) {
> > >                 Comparable max = c.getStatistics().genericGetMax();
> > >                 Comparable min = c.getStatistics().genericGetMin();
> > >                 System.out.println("\t" + c.getPath() + " [" + min +
> > > ", " + max + "]");
> > >             }
> > >         });
> > >     }
> > > });
> > >
> > >
> > > Thanks,
> > > Pradeep
> > >
> > > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <[email protected]> wrote:
> > >
> > > > Hi Pradeep,
> > > >
> > > > I don't have any experience with using Parquet APIs through Spark.
> That
> > > > being said, there are currently several issues around column
> > statistics,
> > > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > > PARQUET-839, PARQUET-840).
> > > >
> > > > However, in your case and depending on the versions involved, you
> might
> > > > also hit PARQUET-251, which can cause statistics for some files to be
> > > > ignored. In this context it may be worth to have a look at this file:
> > > > https://github.com/apache/parquet-mr/blob/master/
> > > > parquet-column/src/main/java/org/apache/parquet/
> CorruptStatistics.java
> > > >
> > > > How did you check that the statistics are not written to the footer?
> If
> > > you
> > > > used parquet-mr, they may be there but be ignored.
> > > >
> > > > Cheers, Lars
> > > >
> > > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Bumping the thread to see if I get any responses.
> > > > >
> > > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > I generated a bunch of parquet files using spark and
> > > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > > "deviceId"
> > > > > > which is a string column. It also has a "timestamp" column of
> > int64.
> > > > > After
> > > > > > the files have been generated, I inspected the file footers and
> > > noticed
> > > > > > that only the "timestamp" field has min/max statistics. My
> primary
> > > > filter
> > > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> > but
> > > > > since
> > > > > > the statistics data is missing, it's not able to prune blocks
> from
> > > > being
> > > > > > read. Am I missing some configuration setting that allows it to
> > > > generate
> > > > > > the stats data? The following is code is how an RDD[Thrift] is
> > being
> > > > > saved
> > > > > > to parquet. The configuration is default configuration.
> > > > > >
> > > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > > ClassTag](rdd: RDD[T]) {
> > > > > >   def saveAsParquet(output: String,
> > > > > >                     conf: Configuration = rdd.context.
> > > > hadoopConfiguration):
> > > > > Unit = {
> > > > > >     val job = Job.getInstance(conf)
> > > > > >     val clazz: Class[T] = classTag[T].runtimeClass.
> > > > > asInstanceOf[Class[T]]
> > > > > >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > > >     val r = rdd.map[(Void, T)](x => (null, x))
> > > > > >       .saveAsNewAPIHadoopFile(
> > > > > >         output,
> > > > > >         classOf[Void],
> > > > > >         clazz,
> > > > > >         classOf[ParquetThriftOutputFormat[T]],
> > > > > >         job.getConfiguration)
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Pradeep
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Missing min/max statistics in file footer

Reply via email to