> Practically speaking, it probalby isn't feasible to have an hbase column per > matrix column
Just in case that is predicated on old information: Distinguishing between columns and column qualifiers, it is architecturally feasible to have a single column family with millions of values in a row with distinct qualifiers. Someone with more depth in this space than I could say if order of millions is sufficient for handling very large sparse matrices, how compelling that might be. In practical terms, HBASE-1537 (https://issues.apache.org/jira/browse/HBASE-1537) makes retrieval of large rows with scanners possible with current svn trunk or upcoming release 0.21.0. (Chunked get of a single row is not currently under consideration.) Previously, indeed due to a limitation of implementation trying to retrieve order of a million values stored in a column would have either blown up the region server or the client due to the need to pack all the data into a single RPC buffer. > Mahout is trying to stay pretty agnostic relative to data storage methods. > [...] > We need to support all of those options. Good to hear. If you have any problems with HBase, please come over to hbase-u...@. Best regards, - Andy Committer, HBase Lurker, Mahout ________________________________ From: Ted Dunning <[email protected]> To: [email protected] Cc: [email protected] Sent: Mon, November 16, 2009 9:35:14 AM Subject: Re: Have a idea of leveraging hbase for machine learning Jeff, Glad to hear you are looking at Mahout. Practically speaking, it probalby isn't feasible to have an hbase column per matrix column. That makes storage of matrix data in hbase somewhat less compelling, although clearly still very useful for some applications. As Grant pointed out, Mahout is trying to stay pretty agnostic relative to data storage methods. Some people need to read matrices from Lucene indexes, others from files, still others from hbase. We need to support all of those options. Your suggestion about making sure that Taste supports hbase is a good one. On Mon, Nov 16, 2009 at 12:54 AM, Jeff Zhang <[email protected]> wrote: > Then we can store them as one hbase row: > A: {tilte:love=>1, > content:I=>1,content:love=>1,content:this=>1,content:game=>1} > > > Using hbase, it will be very easy for us to compute the similarity between > documents. > And another advantage of hbase compared to raw text data is that it's > semi-structured. And I think it will be easy for programming if we use > hbase > rather than the raw data. > -- Ted Dunning, CTO DeepDyve
