[ 
https://issues.apache.org/jira/browse/MAHOUT-202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468517#comment-13468517
 ] 

David Arthur commented on MAHOUT-202:
-------------------------------------

I just put together a quick test of ~1M ratings for 10k items and 1000 users 

[Insert standard disclaimer about local benchmarks]

{code}
12/10/03 08:00:00 INFO hbase.HBaseDataModel: Finished refreshing caches in 2845 
ms
12/10/03 08:00:00 INFO hbase.TestHBaseDataModel: Iterate through all users
12/10/03 08:00:02 INFO hbase.TestHBaseDataModel: Iterated through all users in: 
1669 ms
12/10/03 08:00:02 INFO hbase.TestHBaseDataModel: Counted 1000799 ratings
12/10/03 08:00:02 INFO hbase.TestHBaseDataModel: Total number of users: 1000
12/10/03 08:00:02 INFO hbase.TestHBaseDataModel: Iterate through all items
12/10/03 08:00:06 INFO hbase.TestHBaseDataModel: Iterated through all items in: 
3790 ms
12/10/03 08:00:06 INFO hbase.TestHBaseDataModel: Counted 1000799 ratings
12/10/03 08:00:06 INFO hbase.TestHBaseDataModel: Total number of items: 10000
{code}

The results are pretty consistent between runs. I am running HBase 0.94.1 and 
Hadoop 1.0.2 in pseudo-distributed mode on a 2011 MBP (i7) with 16Gb of memory.

1669µs to get one user's prefs and 379µs for one item - seems fast enough to me 
:). Of course, in a real distributed setup, you will incur (non-trivial) 
network costs.

The cache refresh does a full table scan to discover every user and item id and 
stores them in FastIDSets. This is the only thing that seems reasonable to 
cache, imo. If you're caching the preferences, why bother with a distributed 
database like HBase? An LRU cache might make sense, but (afaik) if you're 
hitting the same user/item repeatedly then you should get see some caching from 
HBase or HDFS (through disk cache) anyways.

Here's a picture of HDFS where this table lives to get an idea of storage size:

{code}
drwxr-xr-x   - mumrah supergroup          0 2012-10-03 07:45 
/hbase/taste/49a6dbaa1fef6274435c9b9551fb348a/items
-rw-r--r--   3 mumrah supergroup   47110005 2012-10-03 07:45 
/hbase/taste/49a6dbaa1fef6274435c9b9551fb348a/items/cdfd7942a3d34472a8622425531dd4d1
drwxr-xr-x   - mumrah supergroup          0 2012-10-03 07:45 
/hbase/taste/49a6dbaa1fef6274435c9b9551fb348a/users
-rw-r--r--   3 mumrah supergroup   47110005 2012-10-03 07:45 
/hbase/taste/49a6dbaa1fef6274435c9b9551fb348a/users/85fadbcf3eb74dfd9b530d86cca5ed64
{code}

So each column family takes up about 45MB to store 1M ratings, and each rating 
takes up only 47 bytes

                
> Make Taste support HBase as data store
> --------------------------------------
>
>                 Key: MAHOUT-202
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-202
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.3
>            Reporter: Jeff Zhang
>            Priority: Minor
>         Attachments: MAHOUT-202.patch
>
>
> I'd like to add hbase as another data store option for taste.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to