Re: Tachyon in Spark
Thanks the response. I got the point - sounds like todays Spark linage dose not push to Tachyon linage. Would be good to see how it works. Jun Feng Liu. Haoyuan Li haoyuan.li@gmail .com To Jun Feng Liu/China/IBM@IBMCN, 2014-12-13 00:17 cc Reynold Xin r...@databricks.com, Andrew Ash and...@andrewash.com, dev@spark.apache.org dev@spark.apache.org Subject Re: Tachyon in Spark Junfeng, by off the heap solution, did you mean rdd.persist(OFF_HEAP)? That feature is different from the lineage feature. You can use this feature (rdd.persist(OFF_HEAP)) now for any Spark version later than 1.0.0 with Tachyon without a problem. Regarding Reynold's last email, those are good points. Tachyon had provided this a while ago. We are working on enhancing this feature and the integration part with Spark. Thanks, Haoyuan On Fri, Dec 12, 2014 at 5:06 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: I think the linage is the key feature of tachyon to reproduce the RDD when any error happen. Otherwise, there have to be some data replica among tachyon nodes to ensure the data redundancy for fault tolerant - I think tachyon is avoiding to go to this path. Dose it mean the off-heap solution is not ready yet if tachyon linage dose not work right now? Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China *Reynold Xin r...@databricks.com r...@databricks.com* 2014/12/12 10:22 To Andrew Ash and...@andrewash.com, cc Jun Feng Liu/China/IBM@IBMCN, dev@spark.apache.org dev@spark.apache.org Subject Re: Tachyon in Spark Actually HY emailed me offline about this and this is supported in the latest version of Tachyon. It is a hard problem to push this into storage; need to think about how to handle isolation, resource allocation, etc. https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin r...@databricks.com wrote: I don't think the lineage thing is even turned on in Tachyon - it was mostly a research prototype, so I don't think it'd make sense for us to use that. On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote: I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then I'm not sure how this isn't a durability concern. On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: Tachyon in Spark
I think the linage is the key feature of tachyon to reproduce the RDD when any error happen. Otherwise, there have to be some data replica among tachyon nodes to ensure the data redundancy for fault tolerant - I think tachyon is avoiding to go to this path. Dose it mean the off-heap solution is not ready yet if tachyon linage dose not work right now? Best Regards Jun Feng Liu IBM China Systems Technology Laboratory in Beijing Phone: 86-10-82452683 E-mail: liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China Reynold Xin r...@databricks.com 2014/12/12 10:22 To Andrew Ash and...@andrewash.com, cc Jun Feng Liu/China/IBM@IBMCN, dev@spark.apache.org dev@spark.apache.org Subject Re: Tachyon in Spark Actually HY emailed me offline about this and this is supported in the latest version of Tachyon. It is a hard problem to push this into storage; need to think about how to handle isolation, resource allocation, etc. https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin r...@databricks.com wrote: I don't think the lineage thing is even turned on in Tachyon - it was mostly a research prototype, so I don't think it'd make sense for us to use that. On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote: I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then I'm not sure how this isn't a durability concern. On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: Tachyon in Spark
I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then I'm not sure how this isn't a durability concern. On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: Tachyon in Spark
I don't think the lineage thing is even turned on in Tachyon - it was mostly a research prototype, so I don't think it'd make sense for us to use that. On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote: I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then I'm not sure how this isn't a durability concern. On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Tachyon in Spark
Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards Jun Feng Liu IBM China Systems Technology Laboratory in Beijing Phone: 86-10-82452683 E-mail: liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China