When the reader ignores the stats, you should see a warning in the logs.
If you have a local build you can easily modify the logic to verify:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347
 
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347>

> On Feb 10, 2017, at 12:39 PM, Lars Volker <[email protected]> wrote:
> 
> In that case I don't see why reading the stats shouldn't work, assuming
> they are in the file in the first place. I don't know why writing them
> would fail, so unless someone else can help you, you may have to debug the
> code that writes them.
> 
> On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota <[email protected]>
> wrote:
> 
>> metadata.getFileMetadata().createdBy() shows this "parquet-mr version
>> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"
>> 
>> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
>> PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>
>> 
>> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <[email protected]> wrote:
>> 
>>> Can you check the value of ParquetMetaData.created_by? Once you have
>> that,
>>> you should see if it gets filtered by the code in CorruptStatistics.java.
>>> 
>>> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <[email protected]
>>> 
>>> wrote:
>>> 
>>>> Data was written with Spark but I'm using the parquet APIs directly for
>>>> reads. I checked the stats in the footer with the following code.
>>>> 
>>>> ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
>>>> ParquetMetadataConverter.NO_FILTER);
>>>> ColumnPath deviceId = ColumnPath.get("deviceId");
>>>> metadata.getBlocks().forEach(b -> {
>>>>    if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
>>>>        System.out.println("\nBlockSize = " + b.getTotalByteSize());
>>>>        System.out.println("ComprSize = " + b.getCompressedSize());
>>>>        System.out.println("Num Rows  = " + b.getRowCount());
>>>>        b.getColumns().forEach(c -> {
>>>>            if (c.getPath().equals(deviceId)) {
>>>>                Comparable max = c.getStatistics().genericGetMax();
>>>>                Comparable min = c.getStatistics().genericGetMin();
>>>>                System.out.println("\t" + c.getPath() + " [" + min +
>>>> ", " + max + "]");
>>>>            }
>>>>        });
>>>>    }
>>>> });
>>>> 
>>>> 
>>>> Thanks,
>>>> Pradeep
>>>> 
>>>> On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <[email protected]> wrote:
>>>> 
>>>>> Hi Pradeep,
>>>>> 
>>>>> I don't have any experience with using Parquet APIs through Spark.
>> That
>>>>> being said, there are currently several issues around column
>>> statistics,
>>>>> both in the format and in the parquet-mr implementation (PARQUET-686,
>>>>> PARQUET-839, PARQUET-840).
>>>>> 
>>>>> However, in your case and depending on the versions involved, you
>> might
>>>>> also hit PARQUET-251, which can cause statistics for some files to be
>>>>> ignored. In this context it may be worth to have a look at this file:
>>>>> https://github.com/apache/parquet-mr/blob/master/
>>>>> parquet-column/src/main/java/org/apache/parquet/
>> CorruptStatistics.java
>>>>> 
>>>>> How did you check that the statistics are not written to the footer?
>> If
>>>> you
>>>>> used parquet-mr, they may be there but be ignored.
>>>>> 
>>>>> Cheers, Lars
>>>>> 
>>>>> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
>>> [email protected]
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Bumping the thread to see if I get any responses.
>>>>>> 
>>>>>> On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
>>>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi folks,
>>>>>>> 
>>>>>>> I generated a bunch of parquet files using spark and
>>>>>>> ParquetThriftOutputFormat. The thirft model has a column called
>>>>>> "deviceId"
>>>>>>> which is a string column. It also has a "timestamp" column of
>>> int64.
>>>>>> After
>>>>>>> the files have been generated, I inspected the file footers and
>>>> noticed
>>>>>>> that only the "timestamp" field has min/max statistics. My
>> primary
>>>>> filter
>>>>>>> will be deviceId, the data is partitioned and sorted by deviceId,
>>> but
>>>>>> since
>>>>>>> the statistics data is missing, it's not able to prune blocks
>> from
>>>>> being
>>>>>>> read. Am I missing some configuration setting that allows it to
>>>>> generate
>>>>>>> the stats data? The following is code is how an RDD[Thrift] is
>>> being
>>>>>> saved
>>>>>>> to parquet. The configuration is default configuration.
>>>>>>> 
>>>>>>> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
>>>>>> ClassTag](rdd: RDD[T]) {
>>>>>>>  def saveAsParquet(output: String,
>>>>>>>                    conf: Configuration = rdd.context.
>>>>> hadoopConfiguration):
>>>>>> Unit = {
>>>>>>>    val job = Job.getInstance(conf)
>>>>>>>    val clazz: Class[T] = classTag[T].runtimeClass.
>>>>>> asInstanceOf[Class[T]]
>>>>>>>    ParquetThriftOutputFormat.setThriftClass(job, clazz)
>>>>>>>    val r = rdd.map[(Void, T)](x => (null, x))
>>>>>>>      .saveAsNewAPIHadoopFile(
>>>>>>>        output,
>>>>>>>        classOf[Void],
>>>>>>>        clazz,
>>>>>>>        classOf[ParquetThriftOutputFormat[T]],
>>>>>>>        job.getConfiguration)
>>>>>>>  }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Pradeep
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to