See responses inline.

On Thu, Sep 3, 2015 at 1:58 AM, kiran lonikar <loni...@gmail.com> wrote:

> Hi,
>
>    1. I found where the code generation
>    
> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala>
>  happens
>    in spark code from the blogs
>    
> https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html,
>
>    
> https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
>    and
>    
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
>    However, I could not find where is the generated code executed? A major
>    part of my changes will be there since this executor will now have to send
>    vectors of columns to GPU RAM, invoke execution, and get the results back
>    to CPU RAM. Thus, the existing executor will significantly change.
>
> The code generation generates Java classes that have an apply method, and
the apply method is called in the operators.

E.g. GenerateUnsafeProjection returns a Projection class (which is just a
class with an apply method), and TungstenProject calls that class.



>
>    1. On the project tungsten blog
>    
> <https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html>,
>    in the third Code Generation section, it is mentioned that you plan
>    to increase the level of code generation from record-at-a-time expression
>    evaluation to vectorized expression evaluation. Has this been implemented?
>    If not, how do I implement this? I will need access to columnar ByteBuffer
>    objects in DataFrame to do this. Having row by row access to data will
>    defeat this exercise. In particular, I need access to
>    
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
>    in the executor of the generated code.
>
>
This is future work. You'd need to create batches of rows or columns. This
is a pretty major refactoring though.


>
>    1. One thing that confuses me is the changes from 1.4 to 1.5 possibly
>    due to JIRA https://issues.apache.org/jira/browse/SPARK-7956 and pull
>    request https://github.com/apache/spark/pull/6479/files*. *This
>    changed the code generation from quasiquotes (q) to string s operator. This
>    makes it simpler for me to generate OpenCL code which is string based. The
>    question, is this branch stable now? Should I make my changes on spark 1.4
>    or spark 1.5 or master branch?
>
> In general Spark development velocity is pretty high, as we make a lot of
changes to internals every release. If I were you, I'd use either master or
branch-1.5 for your prototyping.


>
>    1. How do I tune the batch size (number of rows in the ByteBuffer)? Is
>    it through the property spark.sql.inMemoryColumnarStorage.batchSize?
>
>
> Thanks in anticipation,
>
> Kiran
> PS:
>
> Other things I found useful were:
>
> *Spark DataFrames*: https://www.brighttalk.com/webcast/12891/166495
> *Apache Spark 1.5*: https://www.brighttalk.com/webcast/12891/168177
>
> The links to JavaCL/ScalaCL:
>
> *Library to execute OpenCL code through Java*:
> https://github.com/nativelibs4java/ScalaCL
> *Library to convert Scala code to OpenCL and execute on GPUs*:
> https://github.com/nativelibs4java/JavaCL
>
>
>

Reply via email to