Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Herman van Hövell tot Westerflier Tue, 03 Jan 2017 06:07:21 -0800

@Jacek The maximum output of 200 fields for whole stage code generation has
been chosen to prevent the code generated method from exceeding the 64kb
code limit. There absolutely no relation between this value and the number
of partitions after a shuffle (if there were they should have used the same
configuration).


On Tue, Jan 3, 2017 at 1:55 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Shuai,
>
> Disclaimer: I'm not a spark guru, and what's written below are some
> notes I took when reading spark source code, so I could be wrong, in
> which case I'd appreciate a lot if someone could correct me.
>
> (Yes, I did copy your disclaimer since it applies to me too. Sorry for
> duplication :))
>
> I'd say that the description is very well-written and clear. I'd only add
> that:
>
> 1. CodegenSupport allows custom implementations to optionally disable
> codegen using supportCodegen predicate (that is enabled by default,
> i.e. true)
> 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
> of SparkPlan into another SparkPlan, that searches for sub-plans (aka
> stages) that support codegen and collapse them together as a
> WholeStageCodegen for which supportCodegen is enabled.
> 3. It is assumed that all Expression instances except CodegenFallback
> support codegen.
> 4. CollapseCodegenStages uses the internal setting
> spark.sql.codegen.maxFields (default: 200) to control the number of
> fields in input and output schemas before deactivating whole-stage
> codegen. See https://issues.apache.org/jira/browse/SPARK-14554.
>
> NOTE: The magic number 200 (!) again. I asked about it few days ago
> and in http://stackoverflow.com/questions/41359344/why-is-the-
> number-of-partitions-after-groupby-200
>
> 5. There are side-effecting logical commands that are executed for
> their side-effects that are translated to ExecutedCommandExec in
> BasicOperators strategy and won't take part in codegen.
>
> Thanks for sharing your notes! Gonna merge yours with mine! Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <linshuai2...@gmail.com> wrote:
> > Disclaimer: I'm not a spark guru, and what's written below are some
> notes I
> > took when reading spark source code, so I could be wrong, in which case
> I'd
> > appreciate a lot if someone could correct me.
> >
> >>
> >> > Let me rephrase this. How does the SparkSQL engine call the codegen
> APIs
> >> > to
> >> do the job of producing RDDs?
> >
> >
> > IIUC, physical operators like `ProjectExec` implements
> doProduce/doConsume
> > to support codegen, and when whole-stage codegen is enabled, a subtree
> would
> > be collapsed into a WholeStageCodegenExec wrapper tree, and the root
> node of
> > the wrapper tree would call the doProduce/doConsume method of each
> operator
> > to generate the java source code to be compiled into java byte code by
> > janino.
> >
> > In contrast, when whole stage code gen is disabled (e.g. by passing
> "--conf
> > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute
> method
> > of the physical operators are called so no code generation would happen.
> >
> > The producing of the RDDs is some post-order SparkPlan tree evaluation.
> The
> > leaf node would be some data source: either some file-based
> > HadoopFsRelation, or some external data sources like JdbcRelation, or
> > in-memory LocalRelation created by "spark.range(100)". Above all, the
> leaf
> > nodes could produce rows on their own. Then the evaluation goes in a
> bottom
> > up manner, applying filter/limit/project etc. along the way. The
> generated
> > code or the various doExecute method would be called, depending on
> whether
> > codegen is enabled (the default) or not.
> >
> >> > What are those eval methods in Expressions for given there's already a
> >> > doGenCode next to it?
> >
> >
> > AFAIK the `eval` method of Expression is used to do static evaluation
> when
> > the expression is foldable, e.g.:
> >
> >    select map('a', 1, 'b', 2, 'a', 3) as m
> >
> > Regards,
> > Shuai
> >
> >
> > On Wed, Dec 28, 2016 at 1:05 PM, dragonly <liyilon...@gmail.com> wrote:
> >>
> >> Thanks for your reply!
> >>
> >> Here's my *understanding*:
> >> basic types that ScalaReflection understands are encoded into tungsten
> >> binary format, while UDTs are encoded into GenericInternalRow, which
> >> stores
> >> the JVM objects in an Array[Any] under the hood, and thus lose those
> >> memory
> >> footprint efficiency and cpu cache efficiency stuff provided by tungsten
> >> encoding.
> >>
> >> If the above is correct, then here are my *further questions*:
> >> Are SparkPlan nodes (those ends with Exec) all codegenerated before
> >> actually
> >> running the toRdd logic? I know there are some non-codegenable nodes
> which
> >> implement trait CodegenFallback, but there's also a doGenCode method in
> >> the
> >> trait, so the actual calling convention really puzzles me. And I've
> tried
> >> to
> >> trace those calling flow for a few days but found them scattered every
> >> where. I cannot make a big graph of the method calling order even with
> the
> >> help of IntelliJ.
> >>
> >> Let me rephrase this. How does the SparkSQL engine call the codegen APIs
> >> to
> >> do the job of producing RDDs? What are those eval methods in Expressions
> >> for
> >> given there's already a doGenCode next to it?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://apache-spark-developers-list.1001551.n3.
> nabble.com/What-is-mainly-different-from-a-UDT-and-a-
> spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20376.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] <http://databricks.com/>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Reply via email to