[jira] [Commented] (SPARK-6921) Spark SQL API "saveAsParquetFile" will output tachyon file with different block size

zhangxiongfei (JIRA) Thu, 23 Apr 2015 10:39:50 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509423#comment-14509423
 ]


zhangxiongfei commented on SPARK-6921:
--------------------------------------

I think the root cause may be the following:
1)When the "SQLContext.parquetFile()" is invoked,an instance of case class 
"ParquetRelation2" is created:
def parquetFile(paths: String*): DataFrame =
{ baseRelationToDataFrame(parquet.ParquetRelation2(paths, Map.empty)(this)) }
>From now on,the parameter "val sqlContext: SQLContext" of case class 
>"ParquetRelation2" does not reference to the original SQLContext instance 
>which is instantiated in Spark Shell any more.
2)Try to set hadoopconfiguration
sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
This code does NOT change the field ""val sqlContext: SQLContext"" of 
"ParquetRelation2" instance.It only change the original sqlContext.
3)Save the current DataFrame as Parquet files.
"saveAsParquetFile()" will use the cloned "sqlContext" which is just a 
parameter of "ParquetRelation2" to write the DataFrame, so the configuration of 
"fs.local.block.size" is still the default 32M.

> Spark SQL API "saveAsParquetFile" will output tachyon file with different 
> block size
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-6921
>                 URL: https://issues.apache.org/jira/browse/SPARK-6921
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: zhangxiongfei
>            Priority: Blocker
>
> I run below code  in Spark Shell to access parquet files in Tachyon.
>   1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
>   val ta3 
> =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
>   2.Second, set the "fs.local.block.size" to 256M to make sure that block 
> size of output files in Tachyon is 256M.
>     sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
>  3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
>     
> ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
>  After above code run successfully, the output parquet files were stored in 
> Tachyon,but these files have different block size,below is the information of 
> those files in the path 
> "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
>       File Name                     Size              Block Size     
> In-Memory     Pin     Creation Time
>    _SUCCESS                      0.00 B           256.00 MB     100%         
> NO     04-13-2015 17:48:23:519
>  _common_metadata      1088.00 B      256.00 MB     100%         NO     
> 04-13-2015 17:48:23:741
>  _metadata                       22.71 KB       256.00 MB     100%         NO 
>     04-13-2015 17:48:23:646
>  part-r-00001.parquet     177.19 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:626
>  part-r-00002.parquet     177.21 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:636
>  part-r-00003.parquet     177.02 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:45:439
>  part-r-00004.parquet     177.21 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:845
>  part-r-00005.parquet     177.40 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:638
>  part-r-00006.parquet     177.33 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:648
>  It seems that the API saveAsParquetFile does not distribute/broadcast the 
> hadoopconfiguration to executors like the other API such as 
> saveAsTextFile.The configutation "fs.local.block.size" only take effects on 
> Driver.
>  If I set that configuration before loading parquet files,the problem is gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6921) Spark SQL API "saveAsParquetFile" will output tachyon file with different block size

Reply via email to