On Tue, Dec 8, 2009 at 11:43 PM, stack <[email protected]> wrote:
> Try using this filter instead:
>
> scan.setFilter(FirstKeyOnlyFilter.new())
>
> Will only return row keys, if thats the effect you are looking for.
>
> St.Ack
>
>
> On Tue, Dec 8, 2009 at 3:30 PM, Edward Capriolo <[email protected]>wrote:
>
>> On Tue, Dec 8, 2009 at 6:00 PM, Andrew Purtell <[email protected]>
>> wrote:
>> > I added an entry to the troubleshooting page up on the wiki:
>> >
>> > http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A16
>> >
>> > - Andy
>> >
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: Ryan Rawson <[email protected]>
>> > To: [email protected]
>> > Sent: Tue, December 8, 2009 5:21:25 PM
>> > Subject: Re: PrefixFilter performance question.
>> >
>> > You want:
>> >
>> >
>> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/client/HTable.html#scannerCaching
>> >
>> > The default is low because if a job takes too long processing, a
>> > scanner can time out, which causes unhappy jobs/people/emails.
>> >
>> > BTW I can read small rows out of a 19 node cluster at 7 million
>> > rows/sec using a map-reduce program. Any individual process is doing
>> > 40k+ rows/sec or so
>> >
>> > -ryan
>> >
>> > On Tue, Dec 8, 2009 at 12:25 PM, Edward Capriolo <[email protected]>
>> wrote:
>> >> Hey all,
>> >>
>> >> I have been doing some performance evaluation with mysql vs hbase.
>> >>
>> >> I have a table webtable
>> >> {NAME => 'webdata', FAMILIES => [{NAME => 'anchor', COMPRESSION =>
>> >> 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536',
>> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'image',
>> >> COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE
>> >> => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
>> >> 'raw_data', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>> >> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>> >> => 'true'}]}
>> >>
>> >> I have a normalized version in mysql. I currently have loaded
>> >>
>> >> nyhadoopdev6:60030 1260289750689 requests=4, regions=3,
>> usedHeap=99, maxHeap=997
>> >> nyhadoopdev7:60030 1260289862481 requests=0, regions=2,
>> usedHeap=181,
>> >> maxHeap=997
>> >> nyhadoopdev8:60030 1260289909059 requests=0, regions=2,
>> usedHeap=395,
>> >> maxHeap=997
>> >>
>> >> This is a snippet here.
>> >>
>> >> if (mysql) {
>> >> try {
>> >> PreparedStatement ps = conn.prepareStatement("SELECT * FROM
>> >> page WHERE page LIKE (?)");
>> >> ps.setString(1,"http://www.s%");
>> >> ResultSet rs = ps.executeQuery();
>> >> while (rs.next() ){
>> >> sPageCount++;
>> >> }
>> >> rs.close();
>> >> ps.close();
>> >> } catch (SQLException ex) {System.out.println(ex); System.exit(1);
>> }
>> >> }
>> >>
>> >> if (hbase) {
>> >> Scan s = new Scan();
>> >> //s.setCacheBlocks(true);
>> >> s.setFilter( new PrefixFilter(Bytes.toBytes("http://www.s") ) );
>> >> ResultScanner scanner = table.getScanner(s);
>> >> try {
>> >> for (Result rr:scanner){
>> >> sPageCount++;
>> >> }
>> >> } finally {
>> >> scanner.close();
>> >> }
>> >>
>> >> }
>> >>
>> >> I am seeing about .3 MS from mysql and 20. second performance from
>> >> Hbase. I have read some tuning docs but most seem geared for insertion
>> >> speed, not search speed. I would think this would be a
>> >> Bread-and-butter search for hbase since the row keys are naturally
>> >> sorted lexicographically. I am not running a giant setup here, 3
>> >> nodes, 2x replication, but I would think that it is almost a non
>> >> factor here since these data is fairly small. Hints ?
>> >>
>> >
>> >
>> >
>> >
>>
>> I raised this to from 1-30 -> 18 sec
>> I raised this to 100 ->17 sec
>> I raised this to 1000 ->OOM
>>
>> The OOM pointed me in the direction that this comparison is not apples
>> to apples. In mysql the page table is normalized, but in HBASE it is
>> not. I see lots of data moving across the wire.
>>
>> I tried to filter to just move the ROW key across the wire but I do
>> not think I have it right...
>>
>> List<Filter> filters = new ArrayList<Filter>();
>> filters.add( new PrefixFilter(Bytes.toBytes("http://www.s") ) ) ;
>> filters.add( new QualifierFilter( CompareOp.EQUAL, new
>> BinaryComparator(
>>
>> Bytes.toBytes("ROW")) ) );
>> Filter f =new FilterList(Operator.MUST_PASS_ALL, filters);
>> s.setFilter(f);
>> ResultScanner scanner = table.getScanner(s);
>>
>
I have added the smallest family I have.
s.addFamily( Bytes.toBytes("anchor") )
This drops the search to
spage_time:2266 ms
second consecutive search takes
~1000 ms
This is more reasonable, the time discrepancy now could be explained
because each entry has 5-10 random anchors associated with it.
I have used CE HBase 0.20.0 RPM. and guess what I do not have?
FirstKeyOnlyFilter :) I really like the layout Hbase layout/init
scripts this RPM provides. I can't seem to find the src.rpm for it
anywhere. If I do not find it in a few days, I might just to latest or
trunk. (side note Does anyone have the source RPM?)