Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Jacek Laskowski Tue, 03 Jan 2017 04:57:06 -0800

Hi Shuai,

Disclaimer: I'm not a spark guru, and what's written below are some
notes I took when reading spark source code, so I could be wrong, in
which case I'd appreciate a lot if someone could correct me.


(Yes, I did copy your disclaimer since it applies to me too. Sorry for
duplication :))

I'd say that the description is very well-written and clear. I'd only add that:

1. CodegenSupport allows custom implementations to optionally disable
codegen using supportCodegen predicate (that is enabled by default,
i.e. true)
2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
of SparkPlan into another SparkPlan, that searches for sub-plans (aka
stages) that support codegen and collapse them together as a
WholeStageCodegen for which supportCodegen is enabled.
3. It is assumed that all Expression instances except CodegenFallback
support codegen.
4. CollapseCodegenStages uses the internal setting
spark.sql.codegen.maxFields (default: 200) to control the number of
fields in input and output schemas before deactivating whole-stage
codegen. See https://issues.apache.org/jira/browse/SPARK-14554.

NOTE: The magic number 200 (!) again. I asked about it few days ago
and in 
http://stackoverflow.com/questions/41359344/why-is-the-number-of-partitions-after-groupby-200

5. There are side-effecting logical commands that are executed for
their side-effects that are translated to ExecutedCommandExec in
BasicOperators strategy and won't take part in codegen.

Thanks for sharing your notes! Gonna merge yours with mine! Thanks.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin <linshuai2...@gmail.com> wrote:
> Disclaimer: I'm not a spark guru, and what's written below are some notes I
> took when reading spark source code, so I could be wrong, in which case I'd
> appreciate a lot if someone could correct me.
>
>>
>> > Let me rephrase this. How does the SparkSQL engine call the codegen APIs
>> > to
>> do the job of producing RDDs?
>
>
> IIUC, physical operators like `ProjectExec` implements doProduce/doConsume
> to support codegen, and when whole-stage codegen is enabled, a subtree would
> be collapsed into a WholeStageCodegenExec wrapper tree, and the root node of
> the wrapper tree would call the doProduce/doConsume method of each operator
> to generate the java source code to be compiled into java byte code by
> janino.
>
> In contrast, when whole stage code gen is disabled (e.g. by passing "--conf
> spark.sql.codegen.wholeStage=false" to spark submit), the doExecute method
> of the physical operators are called so no code generation would happen.
>
> The producing of the RDDs is some post-order SparkPlan tree evaluation. The
> leaf node would be some data source: either some file-based
> HadoopFsRelation, or some external data sources like JdbcRelation, or
> in-memory LocalRelation created by "spark.range(100)". Above all, the leaf
> nodes could produce rows on their own. Then the evaluation goes in a bottom
> up manner, applying filter/limit/project etc. along the way. The generated
> code or the various doExecute method would be called, depending on whether
> codegen is enabled (the default) or not.
>
>> > What are those eval methods in Expressions for given there's already a
>> > doGenCode next to it?
>
>
> AFAIK the `eval` method of Expression is used to do static evaluation when
> the expression is foldable, e.g.:
>
>    select map('a', 1, 'b', 2, 'a', 3) as m
>
> Regards,
> Shuai
>
>
> On Wed, Dec 28, 2016 at 1:05 PM, dragonly <liyilon...@gmail.com> wrote:
>>
>> Thanks for your reply!
>>
>> Here's my *understanding*:
>> basic types that ScalaReflection understands are encoded into tungsten
>> binary format, while UDTs are encoded into GenericInternalRow, which
>> stores
>> the JVM objects in an Array[Any] under the hood, and thus lose those
>> memory
>> footprint efficiency and cpu cache efficiency stuff provided by tungsten
>> encoding.
>>
>> If the above is correct, then here are my *further questions*:
>> Are SparkPlan nodes (those ends with Exec) all codegenerated before
>> actually
>> running the toRdd logic? I know there are some non-codegenable nodes which
>> implement trait CodegenFallback, but there's also a doGenCode method in
>> the
>> trait, so the actual calling convention really puzzles me. And I've tried
>> to
>> trace those calling flow for a few days but found them scattered every
>> where. I cannot make a big graph of the method calling order even with the
>> help of IntelliJ.
>>
>> Let me rephrase this. How does the SparkSQL engine call the codegen APIs
>> to
>> do the job of producing RDDs? What are those eval methods in Expressions
>> for
>> given there's already a doGenCode next to it?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-mainly-different-from-a-UDT-and-a-spark-internal-type-that-ExpressionEncoder-recognized-tp20370p20376.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Reply via email to