Some Performance number of Spark Blur Connector

Dibyendu Bhattacharya Fri, 24 Oct 2014 11:21:01 -0700

Hi Aaron,

here are some performance number between enqueue mutate and RDD
saveAsHadoopFile both using Spark Streaming.


Set up I used not very optimized one , but can give a idea about both
method of indexing via Spark Streaming.

I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller and
3 Shard Server. My blur table has 9 partitions.

On the same cluster, I was running Spark with 1 Master and 3 Worker. This
is not a good setup but anyway, here are the numbers.

The enqueMutate index rate is around 800 messages / Second.

The RDD saveAsHadoopFile index rate is around 12,000 message /second.

This is few order of magnitude faster.


Not sure if this is a issue with saveAsHadoopFile approach, but I can see
in Shard folder in HDFS has lots of small Lucene *.lnk files are getting
created ( probably for each saveAsHadoopFile call) and there are that many
"insue" folders as you see in screen shot.

And these entries keep increasing to huge number  if this Spark streaming
keep running for some time . Not sure if this has any impact on indexing
and search performance ?

For enque mutate case, this types of folder structure not seen which is
understood .

Both enque and saveAsHadoopFile code is here .
https://github.com/dibbhatt/spark-blur-connector. Will attach the latest
version to JIRA.


[image: Inline image 3]


[image: Inline image 2]

Some Performance number of Spark Blur Connector

Reply via email to