Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Hi Rajendran,

I'm assuming you have some concept of schema and you are intending to
integrate with SchemaRDD instead of normal RDDs.

More responses inline below.


On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com
wrote:


  I am new to Spark source code and looking to see if i can add push-down
 support of spark filters to the storage (in my
  case an object store). I am willing to consider how this can be
 generically done for any store that we might want to
  integrate with spark. I am looking to know the areas that I should look
 into to provide support for a new data store in
  this context. Following below are some of the questions I have to start
 with:

  1. Do we need to create a new RDD class for the new store that we want to
 support? From Spark Context, we create an RDD
  and the operations on data including the filter are performed through the
 RDD methods.


You can create a new RDD type for a new storage system, and you can create
a new table scan operator in sql to read.


  2. When we specify the code for filter task in the RDD.filter() method,
 how does it get communicated to the Executor on
  the data node? Does the Executor need to compile this code on the fly and
 execute it? or how does it work? ( I have
  looked at the code for sometime, but not yet got to figuring this out, so
 i am looking for some pointers that can help me
  come a little up-to-speed in this part of the code)


Right now the best way to do this is to hack the sql strategies, which does
some predicate pushdown into table scan:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

We are in the process of proposing an API that allows external data stores
to hook into the planner. Expect a design proposal in early/mid Sept.

Once that is in place, you wouldn't need to hack the planner anymore. It is
a good idea to start prototyping by hacking the planner, and migrate to the
planner hook API once that is ready.



  3. How long the Executor holds the memory? and how does it decide when to
 release the memory/cache?


Executors by default actually don't hold any data in memory. Spark requires
explicit caching of data, i.e. it's only when rdd.cache() is called then
will Spark executors put the content of that RDD in-memory. The executor
has a thing called BlockManager that does eviction based on LRU.




  Thank you in advance.





 Regards,
 Rajendran.


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin
Linking to the JIRA tracking APIs to hook into the planner:
https://issues.apache.org/jira/browse/SPARK-3248




On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin r...@databricks.com wrote:

 Hi Rajendran,

 I'm assuming you have some concept of schema and you are intending to
 integrate with SchemaRDD instead of normal RDDs.

 More responses inline below.


 On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com
 wrote:


  I am new to Spark source code and looking to see if i can add push-down
 support of spark filters to the storage (in my
  case an object store). I am willing to consider how this can be
 generically done for any store that we might want to
  integrate with spark. I am looking to know the areas that I should look
 into to provide support for a new data store in
  this context. Following below are some of the questions I have to start
 with:

  1. Do we need to create a new RDD class for the new store that we want
 to support? From Spark Context, we create an RDD
  and the operations on data including the filter are performed through
 the RDD methods.


 You can create a new RDD type for a new storage system, and you can create
 a new table scan operator in sql to read.


  2. When we specify the code for filter task in the RDD.filter() method,
 how does it get communicated to the Executor on
  the data node? Does the Executor need to compile this code on the fly
 and execute it? or how does it work? ( I have
  looked at the code for sometime, but not yet got to figuring this out,
 so i am looking for some pointers that can help me
  come a little up-to-speed in this part of the code)


 Right now the best way to do this is to hack the sql strategies, which
 does some predicate pushdown into table scan:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

 We are in the process of proposing an API that allows external data stores
 to hook into the planner. Expect a design proposal in early/mid Sept.

 Once that is in place, you wouldn't need to hack the planner anymore. It
 is a good idea to start prototyping by hacking the planner, and migrate to
 the planner hook API once that is ready.



  3. How long the Executor holds the memory? and how does it decide when
 to release the memory/cache?


 Executors by default actually don't hold any data in memory. Spark
 requires explicit caching of data, i.e. it's only when rdd.cache() is
 called then will Spark executors put the content of that RDD in-memory. The
 executor has a thing called BlockManager that does eviction based on LRU.




  Thank you in advance.





 Regards,
 Rajendran.


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Adding support for a new object store

2014-08-22 Thread Rajendran Appavu

   
 I am new to Spark source code and looking to see if i can add push-down 
support of spark filters to the storage (in my
 case an object store). I am willing to consider how this can be generically 
done for any store that we might want to  
 integrate with spark. I am looking to know the areas that I should look into 
to provide support for a new data store in   
 this context. Following below are some of the questions I have to start with:  
   

   
 1. Do we need to create a new RDD class for the new store that we want to 
support? From Spark Context, we create an RDD   
 and the operations on data including the filter are performed through the RDD 
methods.

   
 2. When we specify the code for filter task in the RDD.filter() method, how 
does it get communicated to the Executor on   
 the data node? Does the Executor need to compile this code on the fly and 
execute it? or how does it work? ( I have   
 looked at the code for sometime, but not yet got to figuring this out, so i am 
looking for some pointers that can help me 
 come a little up-to-speed in this part of the code)
   

   
 3. How long the Executor holds the memory? and how does it decide when to 
release the memory/cache?   

   
 Thank you in advance.  
   

   

   



Regards,
Rajendran.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org