[ 
https://issues.apache.org/jira/browse/PARQUET-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978112#comment-14978112
 ] 

Stavros Kontopoulos commented on PARQUET-241:
---------------------------------------------

I suspected that...Is there some spec about the semantics regarding Parquet?

Btw I built the parquet 1.7.0 with the patch and didnt seem to have any effect 
at the spark level (built spark with the new library as well 1.5.1). Does 
listStatus return the correct order?
I am checking with this code snippet:
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.functions._
 import sqlContext.implicits._
 

sc.makeRDD(1 to 100).sum
case class Record(key: Int, value: String)

val df = sc.parallelize((1 to 100).map(i => Record(i, s"val_$i"))).toDF()
df.registerTempTable("records")

sqlContext.sql("SELECT * FROM records").collect().foreach(println)
df.write.parquet("hdfs://localhost:8020/p2.parquet")

val parquetFile = sqlContext.read.parquet("hdfs://localhost:8020/p2.parquet")
parquetFile.collect().foreach(println)

If you do :

...saveAsTextFile("hdfs://localhost:8020/result3.txt")

...sc.textFile("hdfs://localhost:8020/result3.txt")

and print the results then order is preserved... I was looking at spark code 
too, i was not expecting much difference...

Thanx!

> ParquetInputFormat.getFooters() should return in the same order as what 
> listStatus() returns
> --------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-241
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Mingyu Kim
>
> Because of how the footer cache is implemented, getFooters() returns the 
> footers in a different order than what listStatus() returns.
> When I provided url 
> "hdfs://.../part-00001.parquet,hdfs://.../part-00002.parquet,hdfs://.../part-00003.parquet",
>  ParquetInputFormat.getSplits(), which internally calls getFooters(), 
> returned the splits in a wrong order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to