Re: Discussion Of ML environment/MR, Mahout

Sebastian Schelter Tue, 19 Mar 2013 09:12:14 -0700

Only partially. There are tools to produce the interaction data into
training and test set, and to measure the RMSE on the test set.


However, there is no tooling for cross-validation and nothing automatic.

On 19.03.2013 16:58, Dmitriy Lyubimov wrote:
> Does it provide search for optimum fit aka regularization rate?
> On Mar 19, 2013 8:10 AM, "Sebastian Schelter" <[email protected]> wrote:
> 
>> Played a little more with the code, it works astonishingly well. I was
>> totally off in my expectations.
>>
>> I was able to run an iteration of ALS (two map-only jobs) on the Yahoo
>> Songs dataset (700M interactions) in less than 2 minutes.
>>
>>
>> On 14.03.2013 17:02, Sean Owen wrote:
>>> On Wed, Mar 13, 2013 at 7:41 PM, Sebastian Schelter <[email protected]>
>> wrote:
>>>
>>>> Hadoop has to reschedule every iteration as separate job, reread the
>>>> input data from disk and write the iterations result to HDFS. In fact an
>>>> ALS iteration always includes twice of these things as it needs two M/R
>>>> jobs. GraphLab/Giraph/Stratosphere on the other hand have to do neither
>>>> of these three things (GraphLab even doesn't do synchronous iterations)
>>>> and I highly doubt that a Hadoop implementation can get on par
>> performance.
>>>>
>>>
>>> That's all true but would you imagine I/O is 97.5% of the run-time? A
>>> 100-feature vector is 400 bytes, but to compute an update you need to
>>> invert a 100x100 matrix. I can't see the former taking 40x longer than
>> the
>>> latter. That's why I bet you'll find the current implementation is
>> nothing
>>> like 40x slower.
>>>
>>> 2x? maybe. And 2x is nothing to sneeze at!
>>>
>>
>>
>

Re: Discussion Of ML environment/MR, Mahout

Reply via email to