[ 
https://issues.apache.org/jira/browse/HBASE-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720856#action_12720856
 ] 

Jonathan Gray commented on HBASE-867:
-------------------------------------

I am doing tests for this issue on a 5+1 node cluster, each node is 2core/2gb 
and hosting two HDFS and two HBase instances (0.19 cluster still up but it's 
idle).

Using a newer version of the HBench tool I posted in HBASE-1501, I'm able to 
run a number of different tests with high numbers of columns.

My test is inserting 10 rows, each with 2M columns.  I do it in 200 rounds, 
each round I insert 10k columns in each of the 10 rows.

Qualifiers are incremented binary longs (1 -> 2M), so 8 bytes.  Values are 
randomized binary data of fixed length.  By varying the size of the value (have 
tried between 8 and 32 bytes per value), I can get different behavior.  

With not much memory to give the RS, I run into OOME problems when serializing 
the Result.  I'm going to rerun tests at higher value sizes and get some clean 
logs to look at, making sure I have block caching disabled so it doesn't hog 
heap.

However, with 8 byte values I'm able to import without a problem (causes 
several splits, in the end we have 5 regions for the 10 rows).  In addition to 
the import test, I'm also scanning these 10 rows in two ways.  A full scan (all 
in family) as well as a skip scan (i'm asking for two specific columns, 
qualifier=1 and qualifier=1888888, so beginning and end of each row).

{noformat}
Inserted 10 rows each with 2000000 total columns in 344566ms (34456.6ms/row)

Skip Scanner open
Row [row0] Scanned, Contains 2 Columns (10155 ms)
Row [row1] Scanned, Contains 2 Columns (9978 ms)
Row [row2] Scanned, Contains 2 Columns (10675 ms)
Row [row3] Scanned, Contains 2 Columns (9608 ms)
Row [row4] Scanned, Contains 2 Columns (11703 ms)
Row [row5] Scanned, Contains 2 Columns (12103 ms)
Row [row6] Scanned, Contains 2 Columns (6828 ms)
Row [row7] Scanned, Contains 2 Columns (6603 ms)
Row [row8] Scanned, Contains 2 Columns (6331 ms)
Row [row9] Scanned, Contains 2 Columns (6553 ms)
Scanned 10 rows in 90551ms (9055.1ms/row)

Full Scanner open
Row [row0] Scanned, Contains 2000000 Columns (14374 ms)
Row [row1] Scanned, Contains 2000000 Columns (14879 ms)
Row [row2] Scanned, Contains 2000000 Columns (14053 ms)
Row [row3] Scanned, Contains 2000000 Columns (14263 ms)
Row [row4] Scanned, Contains 2000000 Columns (8811 ms)
Row [row5] Scanned, Contains 2000000 Columns (10327 ms)
Row [row6] Scanned, Contains 2000000 Columns (9757 ms)
Row [row7] Scanned, Contains 2000000 Columns (9343 ms)
Row [row8] Scanned, Contains 2000000 Columns (9526 ms)
Row [row9] Scanned, Contains 2000000 Columns (10004 ms)
Scanned 10 rows in 115342ms (11534.2ms/row)
{noformat}

Repeated runs improve performance, and ordering of the two types of scans makes 
a difference.  Block cache is off so we're seeing the effect of the linux file 
cache.

> If millions of columns in a column family, hbase scanner won't come up
> ----------------------------------------------------------------------
>
>                 Key: HBASE-867
>                 URL: https://issues.apache.org/jira/browse/HBASE-867
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> Our Daniel has uploaded a table that has a column family with millions of 
> columns in it.  He can get items from the table promptly specifying row and 
> column.  Scanning is another matter.  Thread dumping I see we're stuck in the 
> scanner constructor nexting through cells.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to