Re: Few Questions on Blur Architecture...

Ravikumar Govindarajan Tue, 17 Sep 2013 22:55:39 -0700

Thanks a bunch for a concise and quick reply. Few more questions

1. Any pointers/links on how you plan to tackle the availability problem?


Lets say we store-forward hints to the failed shard-server. Won't the HDFS
index-files differ in shard replicas?

2. I did not phrase my question on cross-join correctly. Let me clarify

RowKey = 123

   RecId = 1000
   Family = "ACCOUNTS"
     Col-Name = "NAME"
     Col-Value = "ABC"
     ......

   RecId = 1001
   Family = "CONTACTS"
     Col-Name = "NAME"
     Col-Value = "XYZ"
     Col-Name = "ACCOUNTS-NAME" [FK to RecId=1000]
     Col-Value = "1000"
     .......

Lets say the user specifies the search query as
key=123 AND name:(ABC OR XYZ)

Initially I will apply this query to each of the Family types, namely
"ACCOUNTS", "CONTACTS" etc.... and get their RecIds..

After this, I will have to filter "CONTACTS" family results, based on
RecIds received from "ACCOUNTS" [Join within records of different family,
based on FK]

Is something like this achievable? Can I design it differently to satisfy
my requirements?

--
Ravi



On Tue, Sep 17, 2013 at 7:01 PM, Aaron McCurry <[email protected]> wrote:

> First off let me say welcome!  Hopefully I can answer your questions inline
> below.
>
>
> On Tue, Sep 17, 2013 at 6:52 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > I am quite new to Blur and need some help with the following questions
> >
> > 1. Lets say I have a replication_factor=3 for all HDFS indexes. In case
> one
> > of the server hosting HDFS indexes goes down [temporary or take-down],
> what
> > will happen to writes? Some kind-of HintedHandoff [as in Cassandra] is
> > supported?
> >
>
> When there is a Blur Shard Server failure state in ZooKeeper will change
> and the other shard servers will take action to bring the down shard(s)
> online.  This is similar to the HBase region model.  While the shard(s) are
> being relocated (which really means being reopened from HDFS) writes to the
> shard(s) being moved are not available.  However the bulk load capability
> is always available as long as HDFS is available, this can be used through
> Hadoop MapReduce.
>
>
> >
> > To re-phrase, what is the Consistency Vs Availability trade-off in Blur,
> > with replication_factor>1 for HDFS indexes?
> >
>
> Of the two Consistency is favored over Availability, however we are
> starting development (in 0.3.0) to increase availability during failures.
>
>
> >
> > 2. Since HDFSInputStream is used underneath, will this result in too much
> > of data-transfer back-and-forth? A case of multi-segment-merge or even
> > wild-card search could trigger it.
> >
>
> Blur uses an in process file system cache (Block Cache is the term used in
> the code) to reduce the IO from HDFS.  During index merges data that is not
> in the Block Cache is read from HDFS and the output is written back to
> HDFS.  Overall once an index is hot (been online for some time) the IO for
> any given search is fairly small assuming that the cluster has enough
> memory configured in the Block Cache.
>
>
> >
> > 3. Does Blur also support foreign-key like semantics to search across
> > column-families as well as delete using row_id?
> >
>
> Blur supports something called Row Queries that allow for searches across
> column families within single Rows.  Take a look at this page for a better
> explanation:
>
> http://incubator.apache.org/blur/docs/0.2.0/data-model.html#querying
>
> And yes Blur supports deletes by Row check out:
>
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Fn_Blur_mutate
> and
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_RowMutation
>
> Hopefully this can answer so of your questions.  Let us know if you have
> any more.
>
> Thanks,
> Aaron
>
>
>
>
> >
> > --
> > Ravi
> >
>

Re: Few Questions on Blur Architecture...

Reply via email to