Spark Streaming with Tachyon : Some findings

Dibyendu Bhattacharya Thu, 07 May 2015 01:26:06 -0700

Dear All ,

I have been playing with Spark Streaming on Tachyon as the OFF_HEAP block
store  . Primary reason for evaluating Tachyon is to find if Tachyon can
solve the Spark BlockNotFoundException .


In traditional MEMORY_ONLY StorageLevel, when blocks are evicted , jobs
failed due to block not found exception and storing blocks in
MEMORY_AND_DISK is not a good option either as it impact the throughput a
lot .


To test how Tachyon behave , I took the latest spark 1.4 from master , and
used Tachyon 0.6.4 and configured Tachyon in Fault Tolerant Mode . Tachyon
is running in 3 Node AWS x-large cluster and Spark is running in 3 node AWS
x-large cluster.

I have used the low level Receiver based Kafka consumer (
https://github.com/dibbhatt/kafka-spark-consumer)  which I have written to
pull from Kafka and write Blocks to Tachyon


I found there is similar improvement in throughput (as MEMORY_ONLY case )
but very good overall memory utilization (as it is off heap store) .


But I found one issue on which I need to clarification .


In Tachyon case also , I find  BlockNotFoundException  , but due to a
different reason .  What I see TachyonBlockManager.scala put the blocks in
WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted
from Tachyon Cache and when Spark try to find the block it throws
 BlockNotFoundException .

I see a pull request which discuss the same ..

https://github.com/apache/spark/pull/158#discussion_r11195271


When I modified the WriteType to CACHE_THROUGH , BlockDropException is gone
, but it again impact the throughput ..


Just curious to know , if Tachyon has any settings which can solve the
Block Eviction from Cache to Disk, other than explicitly setting
CACHE_THROUGH  ?

Regards,
Dibyendu

Spark Streaming with Tachyon : Some findings

Reply via email to