I am using following settings when indexing via Spark...

BlurOutputFormat.setIndexLocally(conf, false);

Because of this merge is not happening ? I can see GenericBlurRecordWriter
flush method is calling maybeMerge only when it is using local temp index .

Dibyendu





On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]> wrote:

> On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya <
> [email protected]> wrote:
>
> > Some more observation.
> >
> > As I said, when I index using Spark RDD saveAsHaddop* API , there are
> > bunch of .lnk files and inuse folders got created which never got
> > merged/deleted. But when I stopped the Spark Job which uses Hadoop API
> and
> > started another Spark Job for same table which uses Blur Thrift enqueue
> > mutate call, I can see all the previous .lnk files and inuse folders  are
> > eventually merged and deleted. The index counts is fine and new documents
> > also keep added to index.
> >
>
> Ok, we are in the process of testing this issue.  We will let you know what
> we find.
>
>
> >
> > So I do not think there is any issue with Permissions . Probably the
> merge
> > logic not getting started when indexing is happening using
> BlurOutputFormat.
> >
>
> Not sure when the mergeMaybe call was integrated into the external index
> loading, but it should be merging the segments.
>
> Aaron
>
>
> >
> > Regards,
> > Dibyendu
> >
> > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected]>
> > wrote:
> >
> >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected]>
> >> wrote:
> >>
> >> > Didn't reply to all.
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: Aaron McCurry <[email protected]>
> >> > Date: Fri, Oct 24, 2014 at 3:47 PM
> >> > Subject: Re: Some Performance number of Spark Blur Connector
> >> > To: Dibyendu Bhattacharya <[email protected]>
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya <
> >> > [email protected]> wrote:
> >> >
> >> >> Hi Aaron,
> >> >>
> >> >> here are some performance number between enqueue mutate and RDD
> >> >> saveAsHadoopFile both using Spark Streaming.
> >> >>
> >> >> Set up I used not very optimized one , but can give a idea about both
> >> >> method of indexing via Spark Streaming.
> >> >>
> >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1
> Controller
> >> >> and 3 Shard Server. My blur table has 9 partitions.
> >> >>
> >> >> On the same cluster, I was running Spark with 1 Master and 3 Worker.
> >> This
> >> >> is not a good setup but anyway, here are the numbers.
> >> >>
> >> >> The enqueMutate index rate is around 800 messages / Second.
> >> >>
> >> >> The RDD saveAsHadoopFile index rate is around 12,000 message /second.
> >> >>
> >> >> This is few order of magnitude faster.
> >> >>
> >> >
> >> > That's awesome, thank you for sharing!
> >> >
> >> >
> >> >>
> >> >>
> >> >> Not sure if this is a issue with saveAsHadoopFile approach, but I can
> >> see
> >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are
> >> getting
> >> >> created ( probably for each saveAsHadoopFile call) and there are that
> >> many
> >> >> "insue" folders as you see in screen shot.
> >> >>
> >> >> And these entries keep increasing to huge number  if this Spark
> >> streaming
> >> >> keep running for some time . Not sure if this has any impact on
> >> indexing
> >> >> and search performance ?
> >> >>
> >> >
> >> > They should be merged and removed over time however if there is a
> >> > permission problem blur might not be able to remove the inuse folders.
> >> >
> >>
> >> In a situation where permissions are the problem the .lnk files are
> >> properly cleaned while the .inuse dirs hang around.  If both are hanging
> >> around I suspect it's not permissions.  A lesson freshly learned over
> >> here:)
> >>
> >> --tim
> >>
> >
> >
>

Reply via email to