Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
 In such construct, each operator builds on the previous one, including any
 materialized results etc. If I use a SQL for each of them, I suspect the
 later SQLs will not leverage the earlier SQLs by any means - hence these
 will be inefficient to first approach. Let me know if this is not correct.


This is not correct.  When you run a SQL statement and register it as a
table, it is the logical plan for this query is used when this virtual
table is referenced in later queries, not the results.  SQL queries are
lazy, just like RDDs and DSL queries.  This is illustrated below.


scala sql(SELECT * FROM selectQuery)
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[12] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None

scala sql(SELECT * FROM src).registerAsTable(selectQuery)

scala sql(SELECT key FROM selectQuery)
res5: org.apache.spark.sql.SchemaRDD =
SchemaRDD[24] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#8], (MetastoreRelation default, src, None), None

Even though the second query is running over the results of the first
query (which requested all columns using *), the optimizer is still able to
come up with an efficient plan that avoids reading value from the table,
as can be seen by the arguments of the HiveTableScan.

Note that if you call sqlContext.cacheTable(selectQuery) then you are
correct.  The results will be materialized in an in-memory columnar format,
and subsequent queries will be run over these materialized results.


 The reason for building expressions is that the use case needs these to be
 created on the fly based on some case class at runtime.

 I.e., I can't type these in REPL. The scala code will define some case
 class A (a: ... , b: ..., c: ... ) where class name, member names and types
 will be known before hand and the RDD will be defined on this. Then based
 on user action, above pipeline needs to be constructed on fly. Thus the
 expressions has to be constructed on fly from class members and other
 predicates etc., most probably using expression constructors.

 Could you please share how expressions could be constructed using the APIs
 on expression (and not on REPL) ?


I'm not sure I completely understand the use case here, but you should be
able to construct symbols and use the DSL to create expressions at runtime,
just like in the REPL.

val attrName: String = name
val addExpression: Expression = Symbol(attrName) + Symbol(attrName)

There is currently no public API for constructing expressions manually
other than SQL or the DSL.  While you could dig into
org.apache.spark.sql.catalyst.expressions._, these APIs are considered
internal, and *will not be stable in between versions*.

Michael


Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example.  The first SELECT statement should actually be:

sql(SELECT * FROM src)

Where `src` is a HiveTable with schema (key INT value STRING).


On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote:


 In such construct, each operator builds on the previous one, including any
 materialized results etc. If I use a SQL for each of them, I suspect the
 later SQLs will not leverage the earlier SQLs by any means - hence these
 will be inefficient to first approach. Let me know if this is not correct.


 This is not correct.  When you run a SQL statement and register it as a
 table, it is the logical plan for this query is used when this virtual
 table is referenced in later queries, not the results.  SQL queries are
 lazy, just like RDDs and DSL queries.  This is illustrated below.


 scala sql(SELECT * FROM selectQuery)
 res3: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[12] at RDD at SchemaRDD.scala:93
 == Query Plan ==
 HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None

 scala sql(SELECT * FROM src).registerAsTable(selectQuery)

 scala sql(SELECT key FROM selectQuery)
 res5: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[24] at RDD at SchemaRDD.scala:93
 == Query Plan ==
 HiveTableScan [key#8], (MetastoreRelation default, src, None), None

 Even though the second query is running over the results of the first
 query (which requested all columns using *), the optimizer is still able to
 come up with an efficient plan that avoids reading value from the table,
 as can be seen by the arguments of the HiveTableScan.

 Note that if you call sqlContext.cacheTable(selectQuery) then you are
 correct.  The results will be materialized in an in-memory columnar format,
 and subsequent queries will be run over these materialized results.


 The reason for building expressions is that the use case needs these to
 be created on the fly based on some case class at runtime.

 I.e., I can't type these in REPL. The scala code will define some case
 class A (a: ... , b: ..., c: ... ) where class name, member names and types
 will be known before hand and the RDD will be defined on this. Then based
 on user action, above pipeline needs to be constructed on fly. Thus the
 expressions has to be constructed on fly from class members and other
 predicates etc., most probably using expression constructors.

 Could you please share how expressions could be constructed using the
 APIs on expression (and not on REPL) ?


 I'm not sure I completely understand the use case here, but you should be
 able to construct symbols and use the DSL to create expressions at runtime,
 just like in the REPL.

 val attrName: String = name
 val addExpression: Expression = Symbol(attrName) + Symbol(attrName)

 There is currently no public API for constructing expressions manually
 other than SQL or the DSL.  While you could dig into
 org.apache.spark.sql.catalyst.expressions._, these APIs are considered
 internal, and *will not be stable in between versions*.

 Michael






Example of creating expressions for SchemaRDD methods

2014-04-02 Thread All In A Days Work
For various schemaRDD functions like select, where, orderby, groupby etc. I
would like to create expression objects and pass these to the methods for
execution.

Can someone show some examples of how to create expressions for case class
and execute ? E.g., how to create expressions for select, order by, group
by etc. and execute methods using the expressions ?

Regards,