Re: [Help] Codegen Stage grows beyond 64 KB

Aakash Basu Wed, 20 Jun 2018 09:29:58 -0700

Hi Kazuaki,

It would be really difficult to produce a small S-A code to reproduce this
problem because, I'm running through a big pipeline of feature engineering
where I derive a lot of variables based on the present ones to kind of
explode the size of the table by many folds. Then, when I do any kind of
join, this error shoots up.


I tried with wholeStage.codegen=false, but that errors out the entire
program rather than running it with a lesser optimized code.

Any suggestion on how I can proceed towards a JIRA entry for this?

Thanks,
Aakash.

On Wed, Jun 20, 2018 at 9:41 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
wrote:

> Spark 2.3 tried to split a large generated Java methods into small methods
> as possible. However, this report may remain places that generates a large
> method.
>
> Would it be possible to create a JIRA entry with a small stand alone
> program that can reproduce this problem? It would be very helpful that the
> community will address this problem.
>
> Best regards,
> Kazuaki Ishizaki
>
>
>
> From:        vaquar khan <vaquar.k...@gmail.com>
> To:        Eyal Zituny <eyal.zit...@equalum.io>
> Cc:        Aakash Basu <aakash.spark....@gmail.com>, user <
> user@spark.apache.org>
> Date:        2018/06/18 01:57
> Subject:        Re: [Help] Codegen Stage grows beyond 64 KB
> ------------------------------
>
>
>
> Totally agreed with Eyal .
>
> The problem is that when Java programs generated using Catalyst from
> programs using DataFrame and Dataset are compiled into Java bytecode, the
> size of byte code of one method must not be 64 KB or more, This conflicts
> with the limitation of the Java class file, which is an exception that
> occurs.
>
> In order to avoid occurrence of an exception due to this restriction,
> within Spark, a solution is to split the methods that compile and make Java
> bytecode that is likely to be over 64 KB into multiple methods when
> Catalyst generates Java programs It has been done.
>
> Use persist or any other logical separation in pipeline.
>
> Regards,
> Vaquar khan
>
> On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny <*eyal.zit...@equalum.io*
> <eyal.zit...@equalum.io>> wrote:
> Hi Akash,
> such errors might appear in large spark pipelines, the root cause is a
> 64kb jvm limitation.
> the reason that your job isn't failing at the end is due to spark fallback
> - if code gen is failing, spark compiler will try to create the flow
> without the code gen (less optimized)
> if you do not want to see this error, you can either disable code gen
> using the flag:  spark.sql.codegen.wholeStage= "false"
> or you can try to split your complex pipeline into several spark flows if
> possible
>
> hope that helps
>
> Eyal
>
> On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu <*aakash.spark....@gmail.com*
> <aakash.spark....@gmail.com>> wrote:
> Hi,
>
> I already went through it, that's one use case. I've a complex and very
> big pipeline of multiple jobs under one spark session. Not getting, on how
> to solve this, as it is happening over Logistic Regression and Random
> Forest models, which I'm just using from Spark ML package rather than doing
> anything by myself.
>
> Thanks,
> Aakash.
>
> On Sun 17 Jun, 2018, 8:21 AM vaquar khan, <*vaquar.k...@gmail.com*
> <vaquar.k...@gmail.com>> wrote:
> Hi Akash,
>
> Please check stackoverflow.
>
>
> *https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe*
> <https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe>
>
> Regards,
> Vaquar khan
>
> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu <*aakash.spark....@gmail.com*
> <aakash.spark....@gmail.com>> wrote:
> Hi guys,
>
> I'm getting an error when I'm feature engineering on 30+ columns to create
> about 200+ columns. It is not failing the job, but the ERROR shows. I want
> to know how can I avoid this.
>
> Spark - 2.3.1
> Python - 3.6
>
> Cluster Config -
> 1 Master - 32 GB RAM, 16 Cores
> 4 Slaves - 16 GB RAM, 8 Cores
>
>
> Input data - 8 partitions of parquet file with snappy compression.
>
> My Spark-Submit -> spark-submit --master spark://*192.168.60.20:7077*
> <http://192.168.60.20:7077>--num-executors 4 --executor-cores 5
> --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf
> spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G
> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf
> spark.scheduler.listenerbus.eventqueue.capacity=20000 --conf
> spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py
> > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt
>
>
> Stack-Trace below -
>
> ERROR CodeGenerator:91 - failed to compile: 
> org.codehaus.janino.InternalCompilerException:
> Compiling "GeneratedClass": Code of method "processNext()V" of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling
> "GeneratedClass": Code of method "processNext()V" of class
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
>     at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
>     at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
>     at org.codehaus.janino.SimpleCompiler.compileToClassLoader(
> SimpleCompiler.java:446)
>     at org.codehaus.janino.ClassBodyEvaluator.compileToClass(
> ClassBodyEvaluator.java:313)
>     at org.codehaus.janino.ClassBodyEvaluator.cook(
> ClassBodyEvaluator.java:235)
>     at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>     at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
>     at org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$
> CodeGenerator$$doCompile(CodeGenerator.scala:1417)
>     at org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
>     at org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
>     at org.spark_project.guava.cache.LocalCache$LoadingValueReference.
> loadFuture(LocalCache.java:3599)
>     at org.spark_project.guava.cache.LocalCache$Segment.loadSync(
> LocalCache.java:2379)
>     at org.spark_project.guava.cache.LocalCache$Segment.
> lockedGetOrLoad(LocalCache.java:2342)
>     at org.spark_project.guava.cache.LocalCache$Segment.get(
> LocalCache.java:2257)
>     at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>     at org.spark_project.guava.cache.LocalCache.getOrLoad(
> LocalCache.java:4004)
>     at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.
> get(LocalCache.java:4874)
>     at org.apache.spark.sql.catalyst.expressions.codegen.
> CodeGenerator$.compile(CodeGenerator.scala:1365)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(
> WholeStageCodegenExec.scala:579)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(
> WholeStageCodegenExec.scala:578)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.
> prepareShuffleDependency(ShuffleExchangeExec.scala:92)
>     at org.apache.spark.sql.execution.exchange.
> ShuffleExchangeExec$$anonfun$doExecute$1.apply(
> ShuffleExchangeExec.scala:128)
>     at org.apache.spark.sql.execution.exchange.
> ShuffleExchangeExec$$anonfun$doExecute$1.apply(
> ShuffleExchangeExec.scala:119)
>     at org.apache.spark.sql.catalyst.errors.package$.attachTree(
> package.scala:52)
>     at org.apache.spark.sql.execution.exchange.
> ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.InputAdapter.inputRDDs(
> WholeStageCodegenExec.scala:371)
>     at org.apache.spark.sql.execution.SortExec.inputRDDs(
> SortExec.scala:121)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(
> WholeStageCodegenExec.scala:605)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(
> SortMergeJoinExec.scala:150)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.ProjectExec.doExecute(
> basicPhysicalOperators.scala:70)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(
> SortMergeJoinExec.scala:150)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.ProjectExec.doExecute(
> basicPhysicalOperators.scala:70)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:131)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> executeQuery$1.apply(SparkPlan.scala:155)
>     at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>     at org.apache.spark.sql.execution.SparkPlan.
> executeQuery(SparkPlan.scala:152)
>     at org.apache.spark.sql.execution.SparkPlan.execute(
> SparkPlan.scala:127)
>     at org.apache.spark.sql.execution.columnar.
> InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
>     at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(
> InMemoryRelation.scala:102)
>     at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(
> InMemoryRelation.scala:43)
>     at org.apache.spark.sql.execution.CacheManager$$
> anonfun$cacheQuery$1.apply(CacheManager.scala:97)
>     at org.apache.spark.sql.execution.CacheManager.
> writeLock(CacheManager.scala:67)
>     at org.apache.spark.sql.execution.CacheManager.
> cacheQuery(CacheManager.scala:91)
>     at org.apache.spark.sql.Dataset.persist(Dataset.scala:2924)
>     at sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>     at py4j.Gateway.invoke(Gateway.java:282)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.
> java:132)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:238)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.codehaus.janino.InternalCompilerException: Code of method
> "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.
> GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
>
> Thanks,
> Aakash.
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>
>

Re: [Help] Codegen Stage grows beyond 64 KB

Reply via email to