Hi Aaron, here are some performance number between enqueue mutate and RDD saveAsHadoopFile both using Spark Streaming.
Set up I used not very optimized one , but can give a idea about both method of indexing via Spark Streaming. I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller and 3 Shard Server. My blur table has 9 partitions. On the same cluster, I was running Spark with 1 Master and 3 Worker. This is not a good setup but anyway, here are the numbers. The enqueMutate index rate is around 800 messages / Second. The RDD saveAsHadoopFile index rate is around 12,000 message /second. This is few order of magnitude faster. Not sure if this is a issue with saveAsHadoopFile approach, but I can see in Shard folder in HDFS has lots of small Lucene *.lnk files are getting created ( probably for each saveAsHadoopFile call) and there are that many "insue" folders as you see in screen shot. And these entries keep increasing to huge number if this Spark streaming keep running for some time . Not sure if this has any impact on indexing and search performance ? For enque mutate case, this types of folder structure not seen which is understood . Both enque and saveAsHadoopFile code is here . https://github.com/dibbhatt/spark-blur-connector. Will attach the latest version to JIRA. [image: Inline image 3] [image: Inline image 2]
