Re: Few Questions on Blur Architecture...

rahul challapalli Thu, 19 Sep 2013 14:30:52 -0700

Hi James,

Check out the below discussion from the past.


http://www.mail-archive.com/[email protected]/msg01476.html

- Rahul


On Thu, Sep 19, 2013 at 12:21 PM, James Kebinger <[email protected]>wrote:

> Does Blur take advantage of HDFS short-circuit reads if enabled? Not sure
> if clients have to do anything, but that's a big win with HBase, and a
> great reason to co-locate the work with the data.
>
>
> On Thu, Sep 19, 2013 at 1:46 PM, rahul challapalli <
> [email protected]> wrote:
>
> > I will attempt to answer some of your questions below. Aaron or someone
> > else can correct me if I am wrong
> >
> >
> > On Thu, Sep 19, 2013 at 6:15 AM, Ravikumar Govindarajan <
> > [email protected]> wrote:
> >
> > > Thanks Aaron. I think, it has answered my question. I have a few more
> and
> > > would be great if you can clarify them
> > >
> > > 1. Is the number of shards per-table fixed during table-creation or we
> > can
> > > dynamically allocate shards?
> > >
> >
> >    I believe we cannot dynamically allocate shards. The only thing we can
> > dynamically add to an existing table is new columns
> >
> > >
> > > 2. Assuming I have 10k shards with each shard-size=2GB, for a total of
> 20
> > > TB table size. I typically use RowId = UserId and there are approx 3
> > > million users, in our system
> > >
> > >     How do I ensure that when a user issues a query, I should not
> end-up
> > > searching all these 10k shards, but rather search only a very small set
> > of
> > > shards?
> > >
> >
> >     When a search request is issued to the Blur Controller it searches
> > though all the shard servers in parallel and each shard server searches
> > through all of its shards.
> >     Unlike database partitioning, I believe we cannot direct a search to
> a
> > particular shard.
> >     However
> >         1. Upon shard server start, all the shards are warmed up ie Index
> > Reader's for each shard is loaded into memory
> >         2. Blur uses a block level cache. With sufficient memory
> allocated
> > to the cache, performance will be greatly enhanced
> >
> >
> > >
> > > 3. Are there any advantages of running shard-server and data-nodes
> {HDFS}
> > > in the same machine?
> > >
> > >    Someone else can provide a better answer here.
> >    In a typical Hadoop installation Task Trackers and Data Nodes run
> > alongside each other on the same machine. Since data nodes store the
> first
> > block replica on the
> >    same  machine, shard servers might see an advantage in terms of
> network
> > latency. However I think it is not a good idea to run Blur alongside Task
> > Trackers for
> >    performance reasons
> >
> >
> >
> > > --
> > > Ravi
> > >
> > >
> > > On Thu, Sep 19, 2013 at 2:36 AM, Aaron McCurry <[email protected]>
> > wrote:
> > >
> > > > I will attempt to answer below:
> > > >
> > > > On Wed, Sep 18, 2013 at 1:54 AM, Ravikumar Govindarajan <
> > > > [email protected]> wrote:
> > > >
> > > > > Thanks a bunch for a concise and quick reply. Few more questions
> > > > >
> > > > > 1. Any pointers/links on how you plan to tackle the availability
> > > problem?
> > > > >
> > > > > Lets say we store-forward hints to the failed shard-server. Won't
> the
> > > > HDFS
> > > > > index-files differ in shard replicas?
> > > > >
> > > >
> > > > I am in the process of documenting the strategy and will be adding it
> > to
> > > > JIRA soon.  The way I am planning on solving this problem doesn't
> > involve
> > > > storing the indexes in more than once in HDFS (which of course is
> > > > replicated).
> > > >
> > > >
> > > > >
> > > > > 2. I did not phrase my question on cross-join correctly. Let me
> > clarify
> > > > >
> > > > > RowKey = 123
> > > > >
> > > > >    RecId = 1000
> > > > >    Family = "ACCOUNTS"
> > > > >      Col-Name = "NAME"
> > > > >      Col-Value = "ABC"
> > > > >      ......
> > > > >
> > > > >    RecId = 1001
> > > > >    Family = "CONTACTS"
> > > > >      Col-Name = "NAME"
> > > > >      Col-Value = "XYZ"
> > > > >      Col-Name = "ACCOUNTS-NAME" [FK to RecId=1000]
> > > > >      Col-Value = "1000"
> > > > >      .......
> > > > >
> > > > > Lets say the user specifies the search query as
> > > > > key=123 AND name:(ABC OR XYZ)
> > > > >
> > > > > Initially I will apply this query to each of the Family types,
> namely
> > > > > "ACCOUNTS", "CONTACTS" etc.... and get their RecIds..
> > > > >
> > > > > After this, I will have to filter "CONTACTS" family results, based
> on
> > > > > RecIds received from "ACCOUNTS" [Join within records of different
> > > family,
> > > > > based on FK]
> > > > >
> > > > > Is something like this achievable? Can I design it differently to
> > > satisfy
> > > > > my requirements?
> > > > >
> > > >
> > > > I may not fully understand your scenario.
> > > >
> > > > As I understand your example above:
> > > >
> > > > Row {
> > > >   "id" => "123",
> > > >   "records" => [
> > > >     Record {
> > > >       "recordId" => "1000", "family" => "accounts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     },
> > > >     Record {
> > > >       "recordId" => "1001", "family" => "contacts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     }
> > > >   ]
> > > > }
> > > >
> > > > Let me go through some example queries that we support:
> > > >
> > > > +<accounts.name:abc> +<contacts.name:abc>
> > > >
> > > > Another way of writing it would be:
> > > >
> > > > <accounts.name:abc> AND <contacts.name:abc>
> > > >
> > > > Would yield a hit on the Row, there aren't any FKs in Blur.
> > > >
> > > > However if there are some interesting queries that be done with more
> > > > examples:
> > > >
> > > > Row {
> > > >   "id" => "123",
> > > >   "records" => [
> > > >     Record {
> > > >       "recordId" => "1000", "family" => "accounts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     },
> > > >     Record {
> > > >       "recordId" => "1001", "family" => "contacts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     }
> > > >   ]
> > > > }
> > > >
> > > > Row {
> > > >   "id" => "456",
> > > >   "records" => [
> > > >     Record {
> > > >       "recordId" => "1000", "family" => "accounts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     },
> > > >     Record {
> > > >       "recordId" => "1001", "family" => "contacts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     },
> > > >     Record {
> > > >       "recordId" => "1002", "family" => "contacts",
> > > >       "columns" => [Column {"name" => "def"}]
> > > >     }
> > > >   ]
> > > > }
> > > >
> > > >
> > > > Row {
> > > >   "id" => "789",
> > > >   "records" => [
> > > >     Record {
> > > >       "recordId" => "1000", "family" => "accounts",
> > > >       "columns" => [Column {"name" => "abc"}]
> > > >     },
> > > >     Record {
> > > >       "recordId" => "1002", "family" => "contacts",
> > > >       "columns" => [Column {"name" => "def"}]
> > > >     }
> > > >   ]
> > > > }
> > > >
> > > > For the given query: "<accounts.name:abc> AND <contacts.name:abc>"
> > would
> > > > yield 2 Row hits. 123 and 456
> > > > For the given query: "<accounts.name:abc> AND <contacts.name:def>"
> > would
> > > > yield 2 Row hits. 456 and 789
> > > > For the given query: "<contacts.name:abc> AND <contacts.name:def>"
> > would
> > > > yield 1 Row hit of 456.  NOTICE that the family is the same here
> > > > "contacts".
> > > >
> > > > Also in Blur you can turn off the Row query and just query the
> records.
> > > > This would be your typical Document like access.
> > > >
> > > > I fear that this has not answered your question, so if it hasn't
> please
> > > let
> > > > me know.
> > > >
> > > > Thanks!
> > > >
> > > > Aaron
> > > >
> > > >
> > > >
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 17, 2013 at 7:01 PM, Aaron McCurry <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > First off let me say welcome!  Hopefully I can answer your
> > questions
> > > > > inline
> > > > > > below.
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 17, 2013 at 6:52 AM, Ravikumar Govindarajan <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > I am quite new to Blur and need some help with the following
> > > > questions
> > > > > > >
> > > > > > > 1. Lets say I have a replication_factor=3 for all HDFS indexes.
> > In
> > > > case
> > > > > > one
> > > > > > > of the server hosting HDFS indexes goes down [temporary or
> > > > take-down],
> > > > > > what
> > > > > > > will happen to writes? Some kind-of HintedHandoff [as in
> > Cassandra]
> > > > is
> > > > > > > supported?
> > > > > > >
> > > > > >
> > > > > > When there is a Blur Shard Server failure state in ZooKeeper will
> > > > change
> > > > > > and the other shard servers will take action to bring the down
> > > shard(s)
> > > > > > online.  This is similar to the HBase region model.  While the
> > > shard(s)
> > > > > are
> > > > > > being relocated (which really means being reopened from HDFS)
> > writes
> > > to
> > > > > the
> > > > > > shard(s) being moved are not available.  However the bulk load
> > > > capability
> > > > > > is always available as long as HDFS is available, this can be
> used
> > > > > through
> > > > > > Hadoop MapReduce.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > To re-phrase, what is the Consistency Vs Availability trade-off
> > in
> > > > > Blur,
> > > > > > > with replication_factor>1 for HDFS indexes?
> > > > > > >
> > > > > >
> > > > > > Of the two Consistency is favored over Availability, however we
> are
> > > > > > starting development (in 0.3.0) to increase availability during
> > > > failures.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2. Since HDFSInputStream is used underneath, will this result
> in
> > > too
> > > > > much
> > > > > > > of data-transfer back-and-forth? A case of multi-segment-merge
> or
> > > > even
> > > > > > > wild-card search could trigger it.
> > > > > > >
> > > > > >
> > > > > > Blur uses an in process file system cache (Block Cache is the
> term
> > > used
> > > > > in
> > > > > > the code) to reduce the IO from HDFS.  During index merges data
> > that
> > > is
> > > > > not
> > > > > > in the Block Cache is read from HDFS and the output is written
> back
> > > to
> > > > > > HDFS.  Overall once an index is hot (been online for some time)
> the
> > > IO
> > > > > for
> > > > > > any given search is fairly small assuming that the cluster has
> > enough
> > > > > > memory configured in the Block Cache.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 3. Does Blur also support foreign-key like semantics to search
> > > across
> > > > > > > column-families as well as delete using row_id?
> > > > > > >
> > > > > >
> > > > > > Blur supports something called Row Queries that allow for
> searches
> > > > across
> > > > > > column families within single Rows.  Take a look at this page
> for a
> > > > > better
> > > > > > explanation:
> > > > > >
> > > > > >
> > http://incubator.apache.org/blur/docs/0.2.0/data-model.html#querying
> > > > > >
> > > > > > And yes Blur supports deletes by Row check out:
> > > > > >
> > > > > >
> > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Fn_Blur_mutate
> > > > > > and
> > > > > >
> > > >
> > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_RowMutation
> > > > > >
> > > > > > Hopefully this can answer so of your questions.  Let us know if
> you
> > > > have
> > > > > > any more.
> > > > > >
> > > > > > Thanks,
> > > > > > Aaron
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ravi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Few Questions on Blur Architecture...

Reply via email to