[ 
https://issues.apache.org/jira/browse/SPARK-23371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373862#comment-16373862
 ] 

pin_zhang commented on SPARK-23371:
-----------------------------------

# It's the spark that bundles two versions(1.6 and 1.8) parquet jars in 
classpath.
 # data write with parquet 1.6 and read with 1.8 with the steps.
 # parquet 1.6 write wrong footer in spark, as it cannot load version info on 
Windows OS.

 

 

 

> Parquet Footer data is wrong on window in parquet format partition table 
> -------------------------------------------------------------------------
>
>                 Key: SPARK-23371
>                 URL: https://issues.apache.org/jira/browse/SPARK-23371
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1, 2.1.2
>            Reporter: pin_zhang
>            Priority: Major
>
> On window
> Run SQL in spark shell
>  spark.sql("create table part_test (id string )partitioned by( index int) 
> stored as parquet")
>  spark.sql("insert into part_test partition (index =1) values ('1')")
> Get exception when query spark.sql("select * from part_test ").show()
> For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the 
> version info in spark on window. Classloader try to get version in the 
> parquet-format-2.3.0-incubating.jar
> 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because 
> created_by
>  could not be parsed (see PARQUET-251): parquet-mr
>  org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_
>  by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*))
>  at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>  at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt
>  atistics.java:60)
>  at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq
>  uetStatistics(ParquetMetadataConverter.java:263)
>  at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque
>  tFileReader.java:583)
>  at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF
>  ileReader.java:513)
>  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
>  ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
>  ecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
>  ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>  at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe
>  xt(RecordReaderIterator.scala:39)
>  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
>  t(FileScanRDD.scala:109)
>  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt
>  erator(FileScanRDD.scala:184)
>  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
>  t(FileScanRDD.scala:109)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
>  rator.scan_nextBatch$(Unknown Source)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
>  rator.processNext(Unknown Source)
>  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo
>  wIterator.java:43)
>  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon
>  $1.hasNext(WholeStageCodegenExec.scala:377)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
>  cala:231)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
>  cala:225)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
>  ply$25.apply(RDD.scala:827)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
>  ply$25.apply(RDD.scala:827)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
>  38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
>  java:1142)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
>  .java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to