[
https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509423#comment-14509423
]
zhangxiongfei commented on SPARK-6921:
--------------------------------------
I think the root cause may be the following:
1)When the "SQLContext.parquetFile()" is invoked,an instance of case class
"ParquetRelation2" is created:
def parquetFile(paths: String*): DataFrame =
{ baseRelationToDataFrame(parquet.ParquetRelation2(paths, Map.empty)(this)) }
>From now on,the parameter "val sqlContext: SQLContext" of case class
>"ParquetRelation2" does not reference to the original SQLContext instance
>which is instantiated in Spark Shell any more.
2)Try to set hadoopconfiguration
sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
This code does NOT change the field ""val sqlContext: SQLContext"" of
"ParquetRelation2" instance.It only change the original sqlContext.
3)Save the current DataFrame as Parquet files.
"saveAsParquetFile()" will use the cloned "sqlContext" which is just a
parameter of "ParquetRelation2" to write the DataFrame, so the configuration of
"fs.local.block.size" is still the default 32M.
> Spark SQL API "saveAsParquetFile" will output tachyon file with different
> block size
> ------------------------------------------------------------------------------------
>
> Key: SPARK-6921
> URL: https://issues.apache.org/jira/browse/SPARK-6921
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: zhangxiongfei
> Priority: Blocker
>
> I run below code in Spark Shell to access parquet files in Tachyon.
> 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
> val ta3
> =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
> 2.Second, set the "fs.local.block.size" to 256M to make sure that block
> size of output files in Tachyon is 256M.
> sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
> 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
>
> ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
> After above code run successfully, the output parquet files were stored in
> Tachyon,but these files have different block size,below is the information of
> those files in the path
> "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
> File Name Size Block Size
> In-Memory Pin Creation Time
> _SUCCESS 0.00 B 256.00 MB 100%
> NO 04-13-2015 17:48:23:519
> _common_metadata 1088.00 B 256.00 MB 100% NO
> 04-13-2015 17:48:23:741
> _metadata 22.71 KB 256.00 MB 100% NO
> 04-13-2015 17:48:23:646
> part-r-00001.parquet 177.19 MB 32.00 MB 100% NO
> 04-13-2015 17:46:44:626
> part-r-00002.parquet 177.21 MB 32.00 MB 100% NO
> 04-13-2015 17:46:44:636
> part-r-00003.parquet 177.02 MB 32.00 MB 100% NO
> 04-13-2015 17:46:45:439
> part-r-00004.parquet 177.21 MB 32.00 MB 100% NO
> 04-13-2015 17:46:44:845
> part-r-00005.parquet 177.40 MB 32.00 MB 100% NO
> 04-13-2015 17:46:44:638
> part-r-00006.parquet 177.33 MB 32.00 MB 100% NO
> 04-13-2015 17:46:44:648
> It seems that the API saveAsParquetFile does not distribute/broadcast the
> hadoopconfiguration to executors like the other API such as
> saveAsTextFile.The configutation "fs.local.block.size" only take effects on
> Driver.
> If I set that configuration before loading parquet files,the problem is gone.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]