[
https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated SPARK-7447:
------------------------------
Assignee: Liang-Chi Hsieh
> Large Job submission lag when using Parquet w/ Schema Merging
> -------------------------------------------------------------
>
> Key: SPARK-7447
> URL: https://issues.apache.org/jira/browse/SPARK-7447
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core, Spark Submit
> Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs
> storage, pyspark, 8 x c3.8xlarge nodes.
> spark-conf
> spark.executor.memory 50g
> spark.driver.cores 32
> spark.driver.memory 50g
> spark.default.parallelism 512
> spark.sql.shuffle.partitions 512
> spark.task.maxFailures 30
> spark.executor.logs.rolling.maxRetainedFiles 2
> spark.executor.logs.rolling.size.maxBytes 102400
> spark.executor.logs.rolling.strategy size
> spark.shuffle.spill false
> spark.sql.parquet.cacheMetadata true
> spark.sql.parquet.filterPushdown true
> spark.sql.codegen true
> spark.akka.threads = 64
> Reporter: Brad Willard
> Assignee: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> I have 2.6 billion rows in parquet format and I'm trying to use the new
> schema merging feature (I was enforcing a consistent schema manually before
> in 0.8-1.2 which was annoying).
> I have approximate 200 parquet files with key=<date>. When I load the
> dataframe with the sqlcontext that process is understandably slow because I
> assume it's reading all the meta data from the parquet files and doing the
> initial schema merging. So that's ok.
> However the problem I have is that once I have the dataframe. Doing any
> operation on the dataframe seems to have a 10-30 second lag before it
> actually starts processing the Job and shows up as an Active Job in the Spark
> Manager. This was an instant operation in all previous versions of Spark.
> Once the job actually is running the performance is fantastic, however this
> job submission lag is horrible.
> I'm wondering if there is a bug with recomputing the schema merging. Running
> top on the master node shows some thread maxed out on 1 cpu during the
> lagging time which makes me think it's not net i/o but something
> pre-processing before job submission.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]