Re: Matrix, Hadoop, text, ... (was Re: Hi)

Ted Dunning Tue, 19 Feb 2008 14:41:50 -0800

Some sparse matrices can be sent around in full form (the small ones!).
Some cannot.


Regardless, block decomposition is at the heart of lots of algorithms.  That
means that blocks would need to be distributed, but the individual
operations are themselves matrix operations.  That should mean that a
non-parallel matrix package that provides good "view" operations would be
all that is needed.  Unfortunately, most of the simpler API's like JAMA
don't do views worth a dang.




On 2/19/08 2:30 PM, "Paul Elschot" <[EMAIL PROTECTED]> wrote:

> One advantage of sparse matrices is that they can easily be sent
> around a Hadoop cluster in their complete form.
> 
> Would that still leave a need to distribute sparse matrix operations
> for a single matrix?
> 
> I mean, I just ran into svmlin, it uses a sparse matrix, is meant
> for text classification problems, is less than 2000 lines of .cpp code,
> and works pretty fast at a few first attempts:
> http://people.cs.uchicago.edu/~vikass/svmlin.html
> svmlin has a gpl licence, but it is easy to use in binary form,
> one only needs to  wrap a class around a process executing
> the svmlin program.
> 
> I'm probably missing something, this sounds too easy.
> 
> Regards,
> Paul Elschot
> 
> 
> Op Tuesday 19 February 2008 21:42:25 schreef Grant Ingersoll:
>> My gut feeling is that we are going to have to build our own, but I
>> don't know for sure yet.  Just seems like it would be a lot more work
>> to try to bring someone else's library into Hadoop than to just build
>> what we need in Hadoop, but I am open to suggestions.   Plus, I am
>> biased towards fewer dependencies.  Makes it easier for people to
>> adopt us and easier manage, at the cost of some extra development
>> work.  Besides, no one sounds particularly enthusiastic about what is
>> available.
>> 
>> -Grant
>> 
>> On Feb 19, 2008, at 3:16 PM, Ted Dunning wrote:
>>> I have been unable to determine whether the hadoop matrix is real
>>> or not.
>>> From discussions, it definitely isn't sparse.
>>> 
>>> Sparsity is absolutely a must and not just for text.  Really huge
>>> machine
>>> learning tends toward sparsity, regardless of area.
>>> 
>>> On 2/19/08 12:13 PM, "Jason Rennie" <[EMAIL PROTECTED]> wrote:
>>>> On Mon, Feb 18, 2008 at 8:43 PM, Grant Ingersoll
>>>> <[EMAIL PROTECTED]>
>>>> 
>>>> wrote:
>>>>> yeah, we have had a few discussions on this.   There is some
>>>>> support in Hadoop already for Matrix calculations via a donation,
>>>>> but I don't
>>>>> know that anyone has dug in too deep with it yet.  It may be the
>>>>> case
>>>>> that we start with something, and then decide to go with
>>>>> something else as we get more running time together on this
>>>>> stuff.
>>>> 
>>>> Is the hadoop matrix lib sparse?  I think I took a quick look and
>>>> didn't
>>>> find any indication of such.  If a significant application area of
>>>> mahout is
>>>> text, sparsity is a must.  Even non-text domains, such as
>>>> collaborative
>>>> filtering, often require sparse representation in order to scale
>>>> to medium-sized data sets.  But, yeah, understood that it's good
>>>> to hit the
>>>> ground running, see how far we can get and make changes as
>>>> necessary/useful
>>>> 
>>>> :)
>>>>> 
>>>>> Read only access is available via:
>>>>> svn co http://svn.apache.org/repos/asf/lucene/mahout/trunk
>>>> 
>>>> Thanks.  I was trying to checkout one directory too high.
>>>> 
>>>> Jason
> 
>

Re: Matrix, Hadoop, text, ... (was Re: Hi)

Reply via email to