Some more observation.

As I said, when I index using Spark RDD saveAsHaddop* API , there are bunch
of .lnk files and inuse folders got created which never got merged/deleted.
But when I stopped the Spark Job which uses Hadoop API and started another
Spark Job for same table which uses Blur Thrift enqueue mutate call, I can
see all the previous .lnk files and inuse folders  are eventually merged
and deleted. The index counts is fine and new documents also keep added to
index.

So I do not think there is any issue with Permissions . Probably the merge
logic not getting started when indexing is happening using BlurOutputFormat.

Regards,
Dibyendu

On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]> wrote:

> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]> wrote:
>
> > Didn't reply to all.
> >
> > ---------- Forwarded message ----------
> > From: Aaron McCurry <[email protected]>
> > Date: Fri, Oct 24, 2014 at 3:47 PM
> > Subject: Re: Some Performance number of Spark Blur Connector
> > To: Dibyendu Bhattacharya <[email protected]>
> >
> >
> >
> >
> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
> > [email protected]> wrote:
> >
> >> Hi Aaron,
> >>
> >> here are some performance number between enqueue mutate and RDD
> >> saveAsHadoopFile both using Spark Streaming.
> >>
> >> Set up I used not very optimized one , but can give a idea about both
> >> method of indexing via Spark Streaming.
> >>
> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 Controller
> >> and 3 Shard Server. My blur table has 9 partitions.
> >>
> >> On the same cluster, I was running Spark with 1 Master and 3 Worker.
> This
> >> is not a good setup but anyway, here are the numbers.
> >>
> >> The enqueMutate index rate is around 800 messages / Second.
> >>
> >> The RDD saveAsHadoopFile index rate is around 12,000 message /second.
> >>
> >> This is few order of magnitude faster.
> >>
> >
> > That's awesome, thank you for sharing!
> >
> >
> >>
> >>
> >> Not sure if this is a issue with saveAsHadoopFile approach, but I can
> see
> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are getting
> >> created ( probably for each saveAsHadoopFile call) and there are that
> many
> >> "insue" folders as you see in screen shot.
> >>
> >> And these entries keep increasing to huge number  if this Spark
> streaming
> >> keep running for some time . Not sure if this has any impact on indexing
> >> and search performance ?
> >>
> >
> > They should be merged and removed over time however if there is a
> > permission problem blur might not be able to remove the inuse folders.
> >
>
> In a situation where permissions are the problem the .lnk files are
> properly cleaned while the .inuse dirs hang around.  If both are hanging
> around I suspect it's not permissions.  A lesson freshly learned over
> here:)
>
> --tim
>

Reply via email to