See responses inline. On Thu, Sep 3, 2015 at 1:58 AM, kiran lonikar <loni...@gmail.com> wrote:
> Hi, > > 1. I found where the code generation > > <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala> > happens > in spark code from the blogs > > https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html, > > > https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html > and > > https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html. > However, I could not find where is the generated code executed? A major > part of my changes will be there since this executor will now have to send > vectors of columns to GPU RAM, invoke execution, and get the results back > to CPU RAM. Thus, the existing executor will significantly change. > > The code generation generates Java classes that have an apply method, and the apply method is called in the operators. E.g. GenerateUnsafeProjection returns a Projection class (which is just a class with an apply method), and TungstenProject calls that class. > > 1. On the project tungsten blog > > <https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html>, > in the third Code Generation section, it is mentioned that you plan > to increase the level of code generation from record-at-a-time expression > evaluation to vectorized expression evaluation. Has this been implemented? > If not, how do I implement this? I will need access to columnar ByteBuffer > objects in DataFrame to do this. Having row by row access to data will > defeat this exercise. In particular, I need access to > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala > in the executor of the generated code. > > This is future work. You'd need to create batches of rows or columns. This is a pretty major refactoring though. > > 1. One thing that confuses me is the changes from 1.4 to 1.5 possibly > due to JIRA https://issues.apache.org/jira/browse/SPARK-7956 and pull > request https://github.com/apache/spark/pull/6479/files*. *This > changed the code generation from quasiquotes (q) to string s operator. This > makes it simpler for me to generate OpenCL code which is string based. The > question, is this branch stable now? Should I make my changes on spark 1.4 > or spark 1.5 or master branch? > > In general Spark development velocity is pretty high, as we make a lot of changes to internals every release. If I were you, I'd use either master or branch-1.5 for your prototyping. > > 1. How do I tune the batch size (number of rows in the ByteBuffer)? Is > it through the property spark.sql.inMemoryColumnarStorage.batchSize? > > > Thanks in anticipation, > > Kiran > PS: > > Other things I found useful were: > > *Spark DataFrames*: https://www.brighttalk.com/webcast/12891/166495 > *Apache Spark 1.5*: https://www.brighttalk.com/webcast/12891/168177 > > The links to JavaCL/ScalaCL: > > *Library to execute OpenCL code through Java*: > https://github.com/nativelibs4java/ScalaCL > *Library to convert Scala code to OpenCL and execute on GPUs*: > https://github.com/nativelibs4java/JavaCL > > >