When the reader ignores the stats, you should see a warning in the logs. If you have a local build you can easily modify the logic to verify: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347 <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347>
> On Feb 10, 2017, at 12:39 PM, Lars Volker <[email protected]> wrote: > > In that case I don't see why reading the stats shouldn't work, assuming > they are in the file in the first place. I don't know why writing them > would fail, so unless someone else can help you, you may have to debug the > code that writes them. > > On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota <[email protected]> > wrote: > >> metadata.getFileMetadata().createdBy() shows this "parquet-mr version >> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)" >> >> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on >> PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869> >> >> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <[email protected]> wrote: >> >>> Can you check the value of ParquetMetaData.created_by? Once you have >> that, >>> you should see if it gets filtered by the code in CorruptStatistics.java. >>> >>> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <[email protected] >>> >>> wrote: >>> >>>> Data was written with Spark but I'm using the parquet APIs directly for >>>> reads. I checked the stats in the footer with the following code. >>>> >>>> ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path, >>>> ParquetMetadataConverter.NO_FILTER); >>>> ColumnPath deviceId = ColumnPath.get("deviceId"); >>>> metadata.getBlocks().forEach(b -> { >>>> if (b.getTotalByteSize() > 4 * 1024 * 1024L) { >>>> System.out.println("\nBlockSize = " + b.getTotalByteSize()); >>>> System.out.println("ComprSize = " + b.getCompressedSize()); >>>> System.out.println("Num Rows = " + b.getRowCount()); >>>> b.getColumns().forEach(c -> { >>>> if (c.getPath().equals(deviceId)) { >>>> Comparable max = c.getStatistics().genericGetMax(); >>>> Comparable min = c.getStatistics().genericGetMin(); >>>> System.out.println("\t" + c.getPath() + " [" + min + >>>> ", " + max + "]"); >>>> } >>>> }); >>>> } >>>> }); >>>> >>>> >>>> Thanks, >>>> Pradeep >>>> >>>> On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <[email protected]> wrote: >>>> >>>>> Hi Pradeep, >>>>> >>>>> I don't have any experience with using Parquet APIs through Spark. >> That >>>>> being said, there are currently several issues around column >>> statistics, >>>>> both in the format and in the parquet-mr implementation (PARQUET-686, >>>>> PARQUET-839, PARQUET-840). >>>>> >>>>> However, in your case and depending on the versions involved, you >> might >>>>> also hit PARQUET-251, which can cause statistics for some files to be >>>>> ignored. In this context it may be worth to have a look at this file: >>>>> https://github.com/apache/parquet-mr/blob/master/ >>>>> parquet-column/src/main/java/org/apache/parquet/ >> CorruptStatistics.java >>>>> >>>>> How did you check that the statistics are not written to the footer? >> If >>>> you >>>>> used parquet-mr, they may be there but be ignored. >>>>> >>>>> Cheers, Lars >>>>> >>>>> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota < >>> [email protected] >>>>> >>>>> wrote: >>>>> >>>>>> Bumping the thread to see if I get any responses. >>>>>> >>>>>> On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota < >>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi folks, >>>>>>> >>>>>>> I generated a bunch of parquet files using spark and >>>>>>> ParquetThriftOutputFormat. The thirft model has a column called >>>>>> "deviceId" >>>>>>> which is a string column. It also has a "timestamp" column of >>> int64. >>>>>> After >>>>>>> the files have been generated, I inspected the file footers and >>>> noticed >>>>>>> that only the "timestamp" field has min/max statistics. My >> primary >>>>> filter >>>>>>> will be deviceId, the data is partitioned and sorted by deviceId, >>> but >>>>>> since >>>>>>> the statistics data is missing, it's not able to prune blocks >> from >>>>> being >>>>>>> read. Am I missing some configuration setting that allows it to >>>>> generate >>>>>>> the stats data? The following is code is how an RDD[Thrift] is >>> being >>>>>> saved >>>>>>> to parquet. The configuration is default configuration. >>>>>>> >>>>>>> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : >>>>>> ClassTag](rdd: RDD[T]) { >>>>>>> def saveAsParquet(output: String, >>>>>>> conf: Configuration = rdd.context. >>>>> hadoopConfiguration): >>>>>> Unit = { >>>>>>> val job = Job.getInstance(conf) >>>>>>> val clazz: Class[T] = classTag[T].runtimeClass. >>>>>> asInstanceOf[Class[T]] >>>>>>> ParquetThriftOutputFormat.setThriftClass(job, clazz) >>>>>>> val r = rdd.map[(Void, T)](x => (null, x)) >>>>>>> .saveAsNewAPIHadoopFile( >>>>>>> output, >>>>>>> classOf[Void], >>>>>>> clazz, >>>>>>> classOf[ParquetThriftOutputFormat[T]], >>>>>>> job.getConfiguration) >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Pradeep >>>>>>> >>>>>> >>>>> >>>> >>> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail
