[ 
https://issues.apache.org/jira/browse/CRUNCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389140#comment-14389140
 ] 

Ioannis Kerkinos commented on CRUNCH-505:
-----------------------------------------

Hi Micah,
It does provide an implementation of Hadoop FileSystem, you can find it here 
[1].
Also I think it is quite straightforward to use. Changing the schema to 
"tachyon" is enough as you can see in the example bellow.

Would it be ok if I were to start working on it? If so, do you maybe have some 
tips on where to start? I've been working a bit with Tachyon for my master's 
thesis and I think this would be a useful performance improvement for Crunch.

==EXAMPLE==

Spark/MapReduce without Tachyon 
• Spark 
  – val file = sc.textFile(“hdfs://ip:port/path”) 
• Hadoop MapReduce 
  – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount 
hdfs://localhost:19998/input hdfs://localhost: 19998/output

Spark/MapReduce with Tachyon 
• Spark 
  – val file = sc.textFile(“tachyon://ip:port/path”) 
• Hadoop MapReduce 
  – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount 
tachyon://localhost:19998/input tachyon:// localhost:19998/output

[1]-https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/hadoop/AbstractTFS.java

> Store intermediate data in memory only using Tachyon
> ----------------------------------------------------
>
>                 Key: CRUNCH-505
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-505
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.12.0
>            Reporter: Ioannis Kerkinos
>            Assignee: Josh Wills
>
> Tachyon is a memory-centric distributed storage system that enables reliable 
> data sharing at memory-speed. If used as the storage for intermediate data 
> (between MR jobs) it should improve performance as you won't have to go to 
> HDFS. In order to do so, the MUST_CACHE write type of Tachyon can be used. 
> This will enable data to be persisted in memory only without going to HDFS. 
> So the intermediate data will be read/written at memory-speed and only the 
> final result will be written in HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to