Re: Few Questions on Blur Architecture...

Aaron McCurry Wed, 18 Sep 2013 14:08:25 -0700

I will attempt to answer below:

On Wed, Sep 18, 2013 at 1:54 AM, Ravikumar Govindarajan <
[email protected]> wrote:


> Thanks a bunch for a concise and quick reply. Few more questions
>
> 1. Any pointers/links on how you plan to tackle the availability problem?
>
> Lets say we store-forward hints to the failed shard-server. Won't the HDFS
> index-files differ in shard replicas?
>

I am in the process of documenting the strategy and will be adding it to
JIRA soon.  The way I am planning on solving this problem doesn't involve
storing the indexes in more than once in HDFS (which of course is
replicated).


>
> 2. I did not phrase my question on cross-join correctly. Let me clarify
>
> RowKey = 123
>
>    RecId = 1000
>    Family = "ACCOUNTS"
>      Col-Name = "NAME"
>      Col-Value = "ABC"
>      ......
>
>    RecId = 1001
>    Family = "CONTACTS"
>      Col-Name = "NAME"
>      Col-Value = "XYZ"
>      Col-Name = "ACCOUNTS-NAME" [FK to RecId=1000]
>      Col-Value = "1000"
>      .......
>
> Lets say the user specifies the search query as
> key=123 AND name:(ABC OR XYZ)
>
> Initially I will apply this query to each of the Family types, namely
> "ACCOUNTS", "CONTACTS" etc.... and get their RecIds..
>
> After this, I will have to filter "CONTACTS" family results, based on
> RecIds received from "ACCOUNTS" [Join within records of different family,
> based on FK]
>
> Is something like this achievable? Can I design it differently to satisfy
> my requirements?
>

I may not fully understand your scenario.

As I understand your example above:

Row {
  "id" => "123",
  "records" => [
    Record {
      "recordId" => "1000", "family" => "accounts",
      "columns" => [Column {"name" => "abc"}]
    },
    Record {
      "recordId" => "1001", "family" => "contacts",
      "columns" => [Column {"name" => "abc"}]
    }
  ]
}

Let me go through some example queries that we support:

+<accounts.name:abc> +<contacts.name:abc>

Another way of writing it would be:

<accounts.name:abc> AND <contacts.name:abc>

Would yield a hit on the Row, there aren't any FKs in Blur.

However if there are some interesting queries that be done with more
examples:

Row {
  "id" => "123",
  "records" => [
    Record {
      "recordId" => "1000", "family" => "accounts",
      "columns" => [Column {"name" => "abc"}]
    },
    Record {
      "recordId" => "1001", "family" => "contacts",
      "columns" => [Column {"name" => "abc"}]
    }
  ]
}

Row {
  "id" => "456",
  "records" => [
    Record {
      "recordId" => "1000", "family" => "accounts",
      "columns" => [Column {"name" => "abc"}]
    },
    Record {
      "recordId" => "1001", "family" => "contacts",
      "columns" => [Column {"name" => "abc"}]
    },
    Record {
      "recordId" => "1002", "family" => "contacts",
      "columns" => [Column {"name" => "def"}]
    }
  ]
}


Row {
  "id" => "789",
  "records" => [
    Record {
      "recordId" => "1000", "family" => "accounts",
      "columns" => [Column {"name" => "abc"}]
    },
    Record {
      "recordId" => "1002", "family" => "contacts",
      "columns" => [Column {"name" => "def"}]
    }
  ]
}

For the given query: "<accounts.name:abc> AND <contacts.name:abc>" would
yield 2 Row hits. 123 and 456
For the given query: "<accounts.name:abc> AND <contacts.name:def>" would
yield 2 Row hits. 456 and 789
For the given query: "<contacts.name:abc> AND <contacts.name:def>" would
yield 1 Row hit of 456.  NOTICE that the family is the same here "contacts".

Also in Blur you can turn off the Row query and just query the records.
This would be your typical Document like access.

I fear that this has not answered your question, so if it hasn't please let
me know.

Thanks!

Aaron



>
> --
> Ravi
>
>
>
> On Tue, Sep 17, 2013 at 7:01 PM, Aaron McCurry <[email protected]> wrote:
>
> > First off let me say welcome!  Hopefully I can answer your questions
> inline
> > below.
> >
> >
> > On Tue, Sep 17, 2013 at 6:52 AM, Ravikumar Govindarajan <
> > [email protected]> wrote:
> >
> > > I am quite new to Blur and need some help with the following questions
> > >
> > > 1. Lets say I have a replication_factor=3 for all HDFS indexes. In case
> > one
> > > of the server hosting HDFS indexes goes down [temporary or take-down],
> > what
> > > will happen to writes? Some kind-of HintedHandoff [as in Cassandra] is
> > > supported?
> > >
> >
> > When there is a Blur Shard Server failure state in ZooKeeper will change
> > and the other shard servers will take action to bring the down shard(s)
> > online.  This is similar to the HBase region model.  While the shard(s)
> are
> > being relocated (which really means being reopened from HDFS) writes to
> the
> > shard(s) being moved are not available.  However the bulk load capability
> > is always available as long as HDFS is available, this can be used
> through
> > Hadoop MapReduce.
> >
> >
> > >
> > > To re-phrase, what is the Consistency Vs Availability trade-off in
> Blur,
> > > with replication_factor>1 for HDFS indexes?
> > >
> >
> > Of the two Consistency is favored over Availability, however we are
> > starting development (in 0.3.0) to increase availability during failures.
> >
> >
> > >
> > > 2. Since HDFSInputStream is used underneath, will this result in too
> much
> > > of data-transfer back-and-forth? A case of multi-segment-merge or even
> > > wild-card search could trigger it.
> > >
> >
> > Blur uses an in process file system cache (Block Cache is the term used
> in
> > the code) to reduce the IO from HDFS.  During index merges data that is
> not
> > in the Block Cache is read from HDFS and the output is written back to
> > HDFS.  Overall once an index is hot (been online for some time) the IO
> for
> > any given search is fairly small assuming that the cluster has enough
> > memory configured in the Block Cache.
> >
> >
> > >
> > > 3. Does Blur also support foreign-key like semantics to search across
> > > column-families as well as delete using row_id?
> > >
> >
> > Blur supports something called Row Queries that allow for searches across
> > column families within single Rows.  Take a look at this page for a
> better
> > explanation:
> >
> > http://incubator.apache.org/blur/docs/0.2.0/data-model.html#querying
> >
> > And yes Blur supports deletes by Row check out:
> >
> > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Fn_Blur_mutate
> > and
> > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_RowMutation
> >
> > Hopefully this can answer so of your questions.  Let us know if you have
> > any more.
> >
> > Thanks,
> > Aaron
> >
> >
> >
> >
> > >
> > > --
> > > Ravi
> > >
> >
>

Re: Few Questions on Blur Architecture...

Reply via email to