Dear All , I have been playing with Spark Streaming on Tachyon as the OFF_HEAP block store . Primary reason for evaluating Tachyon is to find if Tachyon can solve the Spark BlockNotFoundException .
In traditional MEMORY_ONLY StorageLevel, when blocks are evicted , jobs failed due to block not found exception and storing blocks in MEMORY_AND_DISK is not a good option either as it impact the throughput a lot . To test how Tachyon behave , I took the latest spark 1.4 from master , and used Tachyon 0.6.4 and configured Tachyon in Fault Tolerant Mode . Tachyon is running in 3 Node AWS x-large cluster and Spark is running in 3 node AWS x-large cluster. I have used the low level Receiver based Kafka consumer ( https://github.com/dibbhatt/kafka-spark-consumer) which I have written to pull from Kafka and write Blocks to Tachyon I found there is similar improvement in throughput (as MEMORY_ONLY case ) but very good overall memory utilization (as it is off heap store) . But I found one issue on which I need to clarification . In Tachyon case also , I find BlockNotFoundException , but due to a different reason . What I see TachyonBlockManager.scala put the blocks in WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted from Tachyon Cache and when Spark try to find the block it throws BlockNotFoundException . I see a pull request which discuss the same .. https://github.com/apache/spark/pull/158#discussion_r11195271 When I modified the WriteType to CACHE_THROUGH , BlockDropException is gone , but it again impact the throughput .. Just curious to know , if Tachyon has any settings which can solve the Block Eviction from Cache to Disk, other than explicitly setting CACHE_THROUGH ? Regards, Dibyendu