Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Ted Dunning Wed, 09 Feb 2011 17:48:18 -0800

DRM's of booleans would be a useful special case for a number of things,
especially if you backed it with a sparse boolean matrix and vector
implementation.


One caveat is that sparse boolean implementations need a bit more cleverness
than you might expect.  Typically you need to change implementation
depending on the statistics of the data:

- for very sparse vectors, run-length encoding, possibly with small-integer
compression on the run-lengths is an excellent way to go for a serialized
form.  In memory, a sorted array of integer indexes of non-zeros is more
useful because it allows more efficient binary search.  Each non-zero bit
costs 4 bits in this representation which can hurt if you have many bits
near each other.t

- for clumpy, but overall sparse vectors, composite run-length and bitmap
representations are useful.  With good compression of the run-lengths, this
reduces to the first case for storage, but you often need a different
in-memory storage format for these.  One candidate is an array of 32 bit
combined bitmap offset and int count that refers to an array of 32 bit
bitmap segments.  In this format, a single bit can cost 8 bytes, but 5
clumped bits could have an amortized cost of just over a bytes.

- when sparsity varies, composite formats might be best.  This is
essentially an overlay of a traditional bitset and representation 1.  This
is best for long-tail Zipfian sorts of data where indexes are roughly
ordered by descending probability.


On Wed, Feb 9, 2011 at 5:06 PM, Lance Norskog (JIRA) <[email protected]>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992818#comment-12992818]
>
> Lance Norskog commented on MAHOUT-322:
> --------------------------------------
>
> Would I use this to do graph logic on boolean matrices? If so, I would want
> a separate Writable for a "block of bits", that is part of a row of bits.
>
> > DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable>
> instead of SequenceFile<IntWritable,VectorWritable>
> >
> -----------------------------------------------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-322
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Math
> >    Affects Versions: 0.3
> >            Reporter: Danny Leshem
> >            Assignee: Jake Mannix
> >            Priority: Minor
> >
> > Class documentation for
> org.apache.mahout.math.hadoop.DistributedRowMatrix states that the matrix
> lives in SequenceFile<WritableComparable, VectorWritable>. Implementation,
> however, assumes SequenceFile<IntWritable, VectorWritable> is passed.
> > Currently, usage of this class inside Mahout is limited to Jake Mannix's
> SVD package, mainly to perform PCA on a massive document corpus. Given such
> corpus, it makes sense to not limit the user by forcing the document "key"
> to be integer. Instead, users should be able to use Text keys (document name
> or id) or keys made of any other arbitrary class. One may even argue that
> forcing a WritableComparable key is too limiting, and a simple Writable key
> should be assumed.
> > In fact, it would be best if DistributedRowMatrix did not read the
> SequenceFile key at all, to allow user-specific classes (unknown to Mahout)
> to be used as opaque keys even when their libraries are not available in
> runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but
> reader has methods to query just the value, avoiding key deserialization
> altogether.
>
> --
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] Commented: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

Reply via email to