Thanks a bunch for a concise and quick reply. Few more questions
1. Any pointers/links on how you plan to tackle the availability problem?
Lets say we store-forward hints to the failed shard-server. Won't the HDFS
index-files differ in shard replicas?
2. I did not phrase my question on cross-join correctly. Let me clarify
RowKey = 123
RecId = 1000
Family = "ACCOUNTS"
Col-Name = "NAME"
Col-Value = "ABC"
......
RecId = 1001
Family = "CONTACTS"
Col-Name = "NAME"
Col-Value = "XYZ"
Col-Name = "ACCOUNTS-NAME" [FK to RecId=1000]
Col-Value = "1000"
.......
Lets say the user specifies the search query as
key=123 AND name:(ABC OR XYZ)
Initially I will apply this query to each of the Family types, namely
"ACCOUNTS", "CONTACTS" etc.... and get their RecIds..
After this, I will have to filter "CONTACTS" family results, based on
RecIds received from "ACCOUNTS" [Join within records of different family,
based on FK]
Is something like this achievable? Can I design it differently to satisfy
my requirements?
--
Ravi
On Tue, Sep 17, 2013 at 7:01 PM, Aaron McCurry <[email protected]> wrote:
> First off let me say welcome! Hopefully I can answer your questions inline
> below.
>
>
> On Tue, Sep 17, 2013 at 6:52 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > I am quite new to Blur and need some help with the following questions
> >
> > 1. Lets say I have a replication_factor=3 for all HDFS indexes. In case
> one
> > of the server hosting HDFS indexes goes down [temporary or take-down],
> what
> > will happen to writes? Some kind-of HintedHandoff [as in Cassandra] is
> > supported?
> >
>
> When there is a Blur Shard Server failure state in ZooKeeper will change
> and the other shard servers will take action to bring the down shard(s)
> online. This is similar to the HBase region model. While the shard(s) are
> being relocated (which really means being reopened from HDFS) writes to the
> shard(s) being moved are not available. However the bulk load capability
> is always available as long as HDFS is available, this can be used through
> Hadoop MapReduce.
>
>
> >
> > To re-phrase, what is the Consistency Vs Availability trade-off in Blur,
> > with replication_factor>1 for HDFS indexes?
> >
>
> Of the two Consistency is favored over Availability, however we are
> starting development (in 0.3.0) to increase availability during failures.
>
>
> >
> > 2. Since HDFSInputStream is used underneath, will this result in too much
> > of data-transfer back-and-forth? A case of multi-segment-merge or even
> > wild-card search could trigger it.
> >
>
> Blur uses an in process file system cache (Block Cache is the term used in
> the code) to reduce the IO from HDFS. During index merges data that is not
> in the Block Cache is read from HDFS and the output is written back to
> HDFS. Overall once an index is hot (been online for some time) the IO for
> any given search is fairly small assuming that the cluster has enough
> memory configured in the Block Cache.
>
>
> >
> > 3. Does Blur also support foreign-key like semantics to search across
> > column-families as well as delete using row_id?
> >
>
> Blur supports something called Row Queries that allow for searches across
> column families within single Rows. Take a look at this page for a better
> explanation:
>
> http://incubator.apache.org/blur/docs/0.2.0/data-model.html#querying
>
> And yes Blur supports deletes by Row check out:
>
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Fn_Blur_mutate
> and
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_RowMutation
>
> Hopefully this can answer so of your questions. Let us know if you have
> any more.
>
> Thanks,
> Aaron
>
>
>
>
> >
> > --
> > Ravi
> >
>