[ https://issues.apache.org/jira/browse/CRUNCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389140#comment-14389140 ]
Ioannis Kerkinos commented on CRUNCH-505: ----------------------------------------- Hi Micah, It does provide an implementation of Hadoop FileSystem, you can find it here [1]. Also I think it is quite straightforward to use. Changing the schema to "tachyon" is enough as you can see in the example bellow. Would it be ok if I were to start working on it? If so, do you maybe have some tips on where to start? I've been working a bit with Tachyon for my master's thesis and I think this would be a useful performance improvement for Crunch. ==EXAMPLE== Spark/MapReduce without Tachyon • Spark – val file = sc.textFile(“hdfs://ip:port/path”) • Hadoop MapReduce – hadoop jar hadoop-‐examples-‐1.0.4.jar wordcount hdfs://localhost:19998/input hdfs://localhost: 19998/output Spark/MapReduce with Tachyon • Spark – val file = sc.textFile(“tachyon://ip:port/path”) • Hadoop MapReduce – hadoop jar hadoop-‐examples-‐1.0.4.jar wordcount tachyon://localhost:19998/input tachyon:// localhost:19998/output [1]-https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/hadoop/AbstractTFS.java > Store intermediate data in memory only using Tachyon > ---------------------------------------------------- > > Key: CRUNCH-505 > URL: https://issues.apache.org/jira/browse/CRUNCH-505 > Project: Crunch > Issue Type: Improvement > Components: Core > Affects Versions: 0.12.0 > Reporter: Ioannis Kerkinos > Assignee: Josh Wills > > Tachyon is a memory-centric distributed storage system that enables reliable > data sharing at memory-speed. If used as the storage for intermediate data > (between MR jobs) it should improve performance as you won't have to go to > HDFS. In order to do so, the MUST_CACHE write type of Tachyon can be used. > This will enable data to be persisted in memory only without going to HDFS. > So the intermediate data will be read/written at memory-speed and only the > final result will be written in HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)