[ 
https://issues.apache.org/jira/browse/KYLIN-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaofeng SHI updated KYLIN-3462:
--------------------------------
    Description: 
In a comparison between Spark and MR cubing, I noticed the cuboid files that 
Spark engine generated is 3x lager than MR, and took 4x larger more disk on 
HDFS than MR.

 

The reason is, the "dfs.replication=2" didn't work when Spark save to HDFS. And 
by default no compression for spark.

 

The converted HFiles are in the same size, the query results are the same. So 
this difference may easily be overlooked.   

  was:
In a comparison between Spark and MR cubing, I noticed the cuboid files that 
Spark engine generated is 3x lager than MR, and took 4x larger more disk on 
HDFS than MR.

 

The reason is, the "dfs.replication=2" didn't work when Spark save to HDFS. And 
by default no compression.   


> "dfs.replication=2" and compression not work in Spark cube engine
> -----------------------------------------------------------------
>
>                 Key: KYLIN-3462
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3462
>             Project: Kylin
>          Issue Type: Bug
>          Components: Spark Engine
>    Affects Versions: v2.3.0, v2.3.1, v2.4.0
>            Reporter: Shaofeng SHI
>            Priority: Major
>         Attachments: cuboid_generated_by_mr.png, cuboid_generated_by_spark.png
>
>
> In a comparison between Spark and MR cubing, I noticed the cuboid files that 
> Spark engine generated is 3x lager than MR, and took 4x larger more disk on 
> HDFS than MR.
>  
> The reason is, the "dfs.replication=2" didn't work when Spark save to HDFS. 
> And by default no compression for spark.
>  
> The converted HFiles are in the same size, the query results are the same. So 
> this difference may easily be overlooked.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to