subject:"Adding support for a new object store"

Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin

Hi Rajendran,

I'm assuming you have some concept of schema and you are intending to
integrate with SchemaRDD instead of normal RDDs.

More responses inline below.

On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com
wrote:

I am new to Spark source code and looking to see if i can add push-down
support of spark filters to the storage (in my
case an object store). I am willing to consider how this can be
generically done for any store that we might want to
integrate with spark. I am looking to know the areas that I should look
into to provide support for a new data store in
this context. Following below are some of the questions I have to start
with:

1. Do we need to create a new RDD class for the new store that we want to
support? From Spark Context, we create an RDD
and the operations on data including the filter are performed through the
RDD methods.

You can create a new RDD type for a new storage system, and you can create
a new table scan operator in sql to read.

2. When we specify the code for filter task in the RDD.filter() method,
how does it get communicated to the Executor on
the data node? Does the Executor need to compile this code on the fly and
execute it? or how does it work? ( I have
looked at the code for sometime, but not yet got to figuring this out, so
i am looking for some pointers that can help me
come a little up-to-speed in this part of the code)

Right now the best way to do this is to hack the sql strategies, which does
some predicate pushdown into table scan:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

We are in the process of proposing an API that allows external data stores
to hook into the planner. Expect a design proposal in early/mid Sept.

Once that is in place, you wouldn't need to hack the planner anymore. It is
a good idea to start prototyping by hacking the planner, and migrate to the
planner hook API once that is ready.

3. How long the Executor holds the memory? and how does it decide when to
release the memory/cache?

Executors by default actually don't hold any data in memory. Spark requires
explicit caching of data, i.e. it's only when rdd.cache() is called then
will Spark executors put the content of that RDD in-memory. The executor
has a thing called BlockManager that does eviction based on LRU.

Thank you in advance.

Regards,
Rajendran.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Adding support for a new object store

2014-08-27 Thread Reynold Xin

Linking to the JIRA tracking APIs to hook into the planner:
https://issues.apache.org/jira/browse/SPARK-3248

On Wed, Aug 27, 2014 at 1:56 PM, Reynold Xin r...@databricks.com wrote:

Hi Rajendran,

I'm assuming you have some concept of schema and you are intending to
integrate with SchemaRDD instead of normal RDDs.

More responses inline below.

On Fri, Aug 22, 2014 at 2:21 AM, Rajendran Appavu appra...@in.ibm.com
wrote:

1. Do we need to create a new RDD class for the new store that we want
to support? From Spark Context, we create an RDD
and the operations on data including the filter are performed through
the RDD methods.

You can create a new RDD type for a new storage system, and you can create
a new table scan operator in sql to read.

2. When we specify the code for filter task in the RDD.filter() method,
how does it get communicated to the Executor on
the data node? Does the Executor need to compile this code on the fly
and execute it? or how does it work? ( I have
looked at the code for sometime, but not yet got to figuring this out,
so i am looking for some pointers that can help me
come a little up-to-speed in this part of the code)

Right now the best way to do this is to hack the sql strategies, which
does some predicate pushdown into table scan:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

We are in the process of proposing an API that allows external data stores
to hook into the planner. Expect a design proposal in early/mid Sept.

Once that is in place, you wouldn't need to hack the planner anymore. It
is a good idea to start prototyping by hacking the planner, and migrate to
the planner hook API once that is ready.

3. How long the Executor holds the memory? and how does it decide when
to release the memory/cache?

Executors by default actually don't hold any data in memory. Spark
requires explicit caching of data, i.e. it's only when rdd.cache() is
called then will Spark executors put the content of that RDD in-memory. The
executor has a thing called BlockManager that does eviction based on LRU.

Thank you in advance.

Regards,
Rajendran.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Adding support for a new object store

2014-08-22 Thread Rajendran Appavu


   
 I am new to Spark source code and looking to see if i can add push-down 
support of spark filters to the storage (in my
 case an object store). I am willing to consider how this can be generically 
done for any store that we might want to  
 integrate with spark. I am looking to know the areas that I should look into 
to provide support for a new data store in   
 this context. Following below are some of the questions I have to start with:  
   

   
 1. Do we need to create a new RDD class for the new store that we want to 
support? From Spark Context, we create an RDD   
 and the operations on data including the filter are performed through the RDD 
methods.

   
 2. When we specify the code for filter task in the RDD.filter() method, how 
does it get communicated to the Executor on   
 the data node? Does the Executor need to compile this code on the fly and 
execute it? or how does it work? ( I have   
 looked at the code for sometime, but not yet got to figuring this out, so i am 
looking for some pointers that can help me 
 come a little up-to-speed in this part of the code)
   

   
 3. How long the Executor holds the memory? and how does it decide when to 
release the memory/cache?   

   
 Thank you in advance.  
   

   

   



Regards,
Rajendran.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Adding support for a new object store

Re: Adding support for a new object store

Adding support for a new object store

3 matches

Site Navigation

Mail list logo

Footer information