gion, you could
> write a
> > > > > little
> > > > > > > > > method to split the first and last key of the region into
> > > several
> > > > > > > > disjoint
> > > > > > > > > lexicograph
split the first and last key of the region into
> > several
> > > > > > > disjoint
> > > > > > > > lexicographic buckets and create a scan for each bucket, then
> > > > execute
> > > > > > > those
> > > > &
gt; > > scans in parallel. Your data probably doesn't distribute
> > uniformly
> > > > over
> > > > > > > lexicographic buckets though so the scans are unlikely to
> execute
> > > at
> > > > a
> > > > > > > constant rate and you'll get results in time proportional to
> the
> > > >
> at
> > > a
> > > > > > constant rate and you'll get results in time proportional to the
> > > > > > lexicographic bucket with the highest cardinality in the region.
> > I'd
> > > be
> > > > > > interested to know if anyone on the list has ever tried this and
> > what
> > &
; what
> > > the
> > > > > results were?
> > > > >
> > > > >
> > > > > Using the much simpler approach of parallelising over regions by
> > > creating
> > > > > multiple disjoint scans client side, as s
ults were?
> > > >
> > > >
> > > > Using the much simpler approach of parallelising over regions by
> > creating
> > > > multiple disjoint scans client side, as suggested, your performance
> now
> > > > depends on your regions which you have some contro
ance now
> > > depends on your regions which you have some control over. You can
> achieve
> > > the same effect by pre-splitting your table such that you empirically
> > > optimise read performance for the dataset you store.
> > >
> > >
> &g
egion start key and end key
> > > before initiating scan operations for every initial load.
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > > On 20 February 2017 at 10:59, ramkrishna vasudevan <
> > > ramkrishna.s.vasu
now
> depends on your regions which you have some control over. You can achieve
> the same effect by pre-splitting your table such that you empirically
> optimise read performance for the dataset you store.
>
>
> Thanks,
>
> Richard
>
>
> _______
issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then
> this
> start and end row may change so in that case it is better you try to
> get
> the regions every time before you issue a scan. Does that make sense to
> you?
&g
<anilk...@gmail.com<mailto:anilk...@gmail.com>> wrote:
Hi ,
I am building an usecase where i have to load the hbase data into
In-memory
database (IMDB). I am scanning the each region and loading data into
IMDB.
i am looking at parallel scanner ( https://issues.apache.org/
jira/brows
; > you?
> >
> > Regards
> > Ram
> >
> > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote:
> >
> > > Hi ,
> > >
> > > I am building an usecase where i have to load the hbase data into
> > In-memory
> > >
gt;
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote:
>
> > Hi ,
> >
> > I am building an usecase where i have to load the hbase data into
> In-memory
> > database (IMDB). I am scanning the each region and loading data into
wrote:
> Hi ,
>
> I am building an usecase where i have to load the hbase data into In-memory
> database (IMDB). I am scanning the each region and loading data into IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
> jira/browse/HBASE-8504, HBASE-193
Hi ,
I am building an usecase where i have to load the hbase data into In-memory
database (IMDB). I am scanning the each region and loading data into IMDB.
i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable
Hi ,
I am building an usecase where i have to load the hbase data into In-memory
database (IMDB). I am scanning the each region and loading data into IMDB.
i am looking at parallel scanner (
https://issues.apache.org/jira/browse/HBASE-8504 ) and HTable#
getRegionsInRange(byte[] startKey, byte
great thread for a real world problem.
Michael, it sounds like the initial design was more of a traditional db
solution, whereas with hbase (and nosql in general) the design is to
denormalize and build your row/cf structure to fit the use case. Disks are
cheap, writes are fast, so build your
I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
for maintaining our cluster. I have a single tweets table in which we store
the tweets, one tweet per row (it has millions of rows currently).
Now I
So in your step 2 you have the following:
FOREACH row IN TABLE alpha:
SELECT something
FROM TABLE alpha
WHERE alpha.url = row.url
Right?
And you are wondering why you are getting timeouts?
...
...
And how long does it take to do a full table scan? ;-)
(there's more, but that's the
Hi Michel
Yes, that is exactly what I do in step 2. I am aware of the reason for the
scanner timeout exceptions. It is the time between two consecutive
invocations of the next call on a specific scanner object. I increased the
scanner timeout to 10 min on the region server and still I keep seeing
Narendra,
Are you trying to solve a real problem, or is this a class project?
Your solution doesn't scale. It's a non starter. 130 seconds for each iteration
times 1 million seconds is how long? 130 million seconds, which is ~36000 hours
or over 4 years to complete.
(the numbers are rough but
-
From: Narendra yadala [mailto:narendra.yad...@gmail.com]
Sent: Thursday, April 19, 2012 8:04 PM
To: user@hbase.apache.org
Subject: Re: HBase parallel scanner performance
Hi Michel
Yes, that is exactly what I do in step 2. I am aware of the reason for the
scanner timeout exceptions
Michael,
Thanks for the response. This is a real problem and not a class project.
Boxes itself costed 9k ;)
I think there is some difference in understanding of the problem. The table
has 2m rows but I am looking at the latest 10k rows only in the outer for
loop. Only in the inner for loop i am
, did you keep an eye on the GC logs?
Thank you.
Regards,
Jieshan
-Original Message-
From: Narendra yadala [mailto:narendra.yad...@gmail.com]
Sent: Thursday, April 19, 2012 8:04 PM
To: user@hbase.apache.org
Subject: Re: HBase parallel scanner performance
Hi Michel
Yes
Narendra,
I think you are still missing the point.
130 seconds to scan the table per iteration.
Even if you have 10K rows
130 * 10^4 or 1.3*10^6 seconds. ~361 hours
Compare that to 10K rows where you then select a single row in your sub select
that has a list of all of the associated rows.
the stop-the-world pause time.
I don't think parallel scanners is the problem.
Jieshan
-Original Message-
From: Narendra yadala [mailto:narendra.yad...@gmail.com]
Sent: Thursday, April 19, 2012 11:24 PM
To: user@hbase.apache.org
Subject: Re: HBase parallel scanner performance
Hi Jieshan
Michael,
I will do the redesign and build the index. Thanks a lot for the insights.
Narendra
On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel michael_se...@hotmail.comwrote:
Narendra,
I think you are still missing the point.
130 seconds to scan the table per iteration.
Even if you have 10K
No problem.
One of the hardest things to do is to try to be open to other design ideas and
not become wedded to one.
I think once you get that working you can start to look at your cluster.
On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote:
Michael,
I will do the redesign and build the
28 matches
Mail list logo