zhangxiongfei created SPARK-6921:
------------------------------------

             Summary: Spark SQL API "saveAsParquetFile" will output tachyon 
file with different block size
                 Key: SPARK-6921
                 URL: https://issues.apache.org/jira/browse/SPARK-6921
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.3.0
            Reporter: zhangxiongfei
            Priority: Blocker


I run below code  in Spark Shell to access parquet files in Tachyon.
  1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
  val ta3 
=sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
  2.Second, set the "fs.local.block.size" to 256M to make sure that block size 
of output files in Tachyon is 256M.
    sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
    
ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
 After above code run successfully, the output parquet files were stored in 
Tachyon,but these files have different block size,below is the information of 
those files in the path 
"tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
      File Name                     Size              Block Size     In-Memory  
   Pin     Creation Time
   _SUCCESS                      0.00 B           256.00 MB     100%         NO 
    04-13-2015 17:48:23:519
 _common_metadata      1088.00 B      256.00 MB     100%         NO     
04-13-2015 17:48:23:741
 _metadata                       22.71 KB       256.00 MB     100%         NO   
  04-13-2015 17:48:23:646
 part-r-00001.parquet     177.19 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:626
 part-r-00002.parquet     177.21 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:636
 part-r-00003.parquet     177.02 MB     32.00 MB      100%         NO     
04-13-2015 17:46:45:439
 part-r-00004.parquet     177.21 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:845
 part-r-00005.parquet     177.40 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:638
 part-r-00006.parquet     177.33 MB     32.00 MB      100%         NO     
04-13-2015 17:46:44:648

 It seems that the API saveAsParquetFile does not distribute/broadcast the 
hadoopconfiguration to executors like the other API such as saveAsTextFile.The 
configutation "fs.local.block.size" only take effects on Driver.
 If I set that configuration before loading parquet files,the problem is gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to