Hi Shuai, Disclaimer: I'm not a spark guru, and what's written below are some notes I took when reading spark source code, so I could be wrong, in which case I'd appreciate a lot if someone could correct me.
(Yes, I did copy your disclaimer since it applies to me too. Sorry for duplication :)) I'd say that the description is very well-written and clear. I'd only add that: 1. CodegenSupport allows custom implementations to optionally disable codegen using supportCodegen predicate (that is enabled by default, i.e. true) 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation of SparkPlan into another SparkPlan, that searches for sub-plans (aka stages) that support codegen and collapse them together as a WholeStageCodegen for which supportCodegen is enabled. 3. It is assumed that all Expression instances except CodegenFallback support codegen. 4. CollapseCodegenStages uses the internal setting spark.sql.codegen.maxFields (default: 200) to control the number of fields in input and output schemas before deactivating whole-stage codegen. See https://issues.apache.org/jira/browse/SPARK-14554. NOTE: The magic number 200 (!) again. I asked about it few days ago and in http://stackoverflow.com/questions/41359344/why-is-the-number-of-partitions-after-groupby-200 5. There are side-effecting logical commands that are executed for their side-effects that are translated to ExecutedCommandExec in BasicOperators strategy and won't take part in codegen. Thanks for sharing your notes! Gonna merge yours with mine! Thanks. Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <linshuai2...@gmail.com> wrote: > Disclaimer: I'm not a spark guru, and what's written below are some notes I > took when reading spark source code, so I could be wrong, in which case I'd > appreciate a lot if someone could correct me. > >> >> > Let me rephrase this. How does the SparkSQL engine call the codegen APIs >> > to >> do the job of producing RDDs? > > > IIUC, physical operators like `ProjectExec` implements doProduce/doConsume > to support codegen, and when whole-stage codegen is enabled, a subtree would > be collapsed into a WholeStageCodegenExec wrapper tree, and the root node of > the wrapper tree would call the doProduce/doConsume method of each operator > to generate the java source code to be compiled into java byte code by > janino. > > In contrast, when whole stage code gen is disabled (e.g. by passing "--conf > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute method > of the physical operators are called so no code generation would happen. > > The producing of the RDDs is some post-order SparkPlan tree evaluation. The > leaf node would be some data source: either some file-based > HadoopFsRelation, or some external data sources like JdbcRelation, or > in-memory LocalRelation created by "spark.range(100)". Above all, the leaf > nodes could produce rows on their own. Then the evaluation goes in a bottom > up manner, applying filter/limit/project etc. along the way. The generated > code or the various doExecute method would be called, depending on whether > codegen is enabled (the default) or not. > >> > What are those eval methods in Expressions for given there's already a >> > doGenCode next to it? > > > AFAIK the `eval` method of Expression is used to do static evaluation when > the expression is foldable, e.g.: > > select map('a', 1, 'b', 2, 'a', 3) as m > > Regards, > Shuai > > > On Wed, Dec 28, 2016 at 1:05 PM, dragonly <liyilon...@gmail.com> wrote: >> >> Thanks for your reply! >> >> Here's my *understanding*: >> basic types that ScalaReflection understands are encoded into tungsten >> binary format, while UDTs are encoded into GenericInternalRow, which >> stores >> the JVM objects in an Array[Any] under the hood, and thus lose those >> memory >> footprint efficiency and cpu cache efficiency stuff provided by tungsten >> encoding. >> >> If the above is correct, then here are my *further questions*: >> Are SparkPlan nodes (those ends with Exec) all codegenerated before >> actually >> running the toRdd logic? I know there are some non-codegenable nodes which >> implement trait CodegenFallback, but there's also a doGenCode method in >> the >> trait, so the actual calling convention really puzzles me. And I've tried >> to >> trace those calling flow for a few days but found them scattered every >> where. I cannot make a big graph of the method calling order even with the >> help of IntelliJ. >> >> Let me rephrase this. How does the SparkSQL engine call the codegen APIs >> to >> do the job of producing RDDs? What are those eval methods in Expressions >> for >> given there's already a doGenCode next to it? >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20376.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org