[GitHub] spark pull request #15538: [SPARK-17993][SQL] Fix Parquet log output redirec...

mallman Tue, 18 Oct 2016 16:16:56 -0700

GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/15538


    [SPARK-17993][SQL] Fix Parquet log output redirection

    (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-17993)
    
    ## What changes were proposed in this pull request?
    
    PR #14690 broke parquet log output redirection for converted partitioned 
Hive tables. For example, when querying parquet files written by Parquet-mr 
1.6.0 Spark prints a torrent of (harmless) warning messages from the Parquet 
reader:
    
    ```
    Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
parquet-mr version 1.6.0
    org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build 
?(.*)\)
        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
        at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
        at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    ```
    
    This only happens during execution, not planning, and it doesn't matter 
what log level the `SparkContext` is set to.
    
    This is a regression I noted as something we needed to fix as a follow up.
    
    It appears that the problem arose because we removed the call to 
`inferSchema` during Hive table conversion. That call is what triggered the 
output redirection.
    
    Rather than count on a call to `inferSchema` (or similar, such as 
`prepareWrite`), this PR initializes log redirection the first time an instance 
of `ParquetFileFormat` is constructed.
    
    ## How was this patch tested?
    
    I tested this manually in four ways:
    
    1. Executing `spark.sqlContext.range(10).selectExpr("id as 
a").write.mode("overwrite").parquet("test")`.
    2. Executing `spark.read.format("parquet").load(legacyParquetFile).show` 
for a Parquet file `legacyParquetFile` written using Parquet-mr 1.6.0.
    3. Executing `select * from legacy_parquet_table limit 1` for some 
unpartitioned Parquet-based Hive table written using Parquet-mr 1.6.0.
    4. Executing `select * from legacy_partitioned_parquet_table where 
partcol=x limit 1` for some partitioned Parquet-based Hive table written using 
Parquet-mr 1.6.0.
    
    I ran each test with a new instance of `spark-shell` or `spark-sql`.
    
    Incidentally, I found that test case 3 was not a regressionâredirection 
was not occurring in the master codebase prior to #14690.
    
    I'm thinking about how to unit test this behavior. If anyone has a 
suggestion for additional test scenarios I'm all ears.
    
    cc @ericl @dongjoon-hyun

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public 
spark-17993-fix_parquet_log_redirection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15538.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15538
    
----
commit 6101b837eec731ad83904593da123c4b30a25f15
Author: Michael Allman <[email protected]>
Date:   2016-10-18T22:39:36Z

    [SPARK-17993][SQL] Perform Parquet log output redirection when
    constructing an instance of `ParquetFileFormat`. Before, it was occurring
    as part of the call to `inferSchema` and in `prepareWrite`, however not
    all Parquet access occurs through one of those methods. We add this
    redirection to the constructor to ensure that any instantiation of
    `ParquetFileFormat` triggers log redirection if it hasn't already
    happened

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15538: [SPARK-17993][SQL] Fix Parquet log output redirec...

Reply via email to