> Practically speaking, it probalby isn't feasible to have an hbase column per 
> matrix column

Just in case that is predicated on old information: Distinguishing between 
columns and column qualifiers, it is architecturally feasible to have a single 
column family with millions of values in a row with distinct qualifiers. 
Someone with more depth in this space than I could say if order of millions is 
sufficient for handling very large sparse matrices, how compelling that might 
be. In practical terms, HBASE-1537 
(https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large 
rows with scanners possible with current svn trunk or upcoming release 0.21.0. 
(Chunked get of a single row is not currently under consideration.) Previously, 
indeed due to a limitation of implementation trying to retrieve order of a 
million values stored in a column would have either blown up the region server 
or the client due to the need to pack all the data into a single RPC buffer. 

> Mahout is trying to stay pretty agnostic relative to data storage methods.
> [...]
> We need to support all of those options.

Good to hear. If you have any problems with HBase, please come over to 
hbase-u...@. 

Best regards,

   - Andy
     Committer, HBase
     Lurker, Mahout




________________________________
From: Ted Dunning <[email protected]>
To: [email protected]
Cc: [email protected]
Sent: Mon, November 16, 2009 9:35:14 AM
Subject: Re: Have a idea of leveraging hbase for machine learning

Jeff,

Glad to hear you are looking at Mahout.

Practically speaking, it probalby isn't feasible to have an hbase column per
matrix column.  That makes storage of matrix data in hbase somewhat less
compelling, although clearly still very useful for some applications.

As Grant pointed out, Mahout is trying to stay pretty agnostic relative to
data storage methods.  Some people need to read matrices from Lucene
indexes, others from files, still others from hbase.  We need to support all
of those options.

Your suggestion about making sure that Taste supports hbase is a good one.

On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <[email protected]> wrote:

> Then we can store them as one hbase row:
> A: {tilte:love=>1,
> content:I=>1,content:love=>1,content:this=>1,content:game=>1}
>
>
> Using hbase, it will be very easy for us to compute the similarity between
> documents.
> And another  advantage of hbase compared to raw text data is that it's
> semi-structured. And I think it will be easy for programming if we use
> hbase
> rather than the raw data.
>



-- 
Ted Dunning, CTO
DeepDyve



      

Reply via email to