[ 
https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Leshem updated MAHOUT-308:
--------------------------------

    Status: Patch Available  (was: Open)

Following email correspondence with Jake, attached is a suggested patch to 
solve this issue.

The general idea was to define a new VectorIterableWriter that allows 
sequentially writing vectors to some underlying storage, and construct a 
VectorIterable over them when done. Currently implemented are RowMatrixWriter 
that uses a given matrix as storage, and DistributedRowMatrixWriter that uses a 
DistributedRowMatrix as storage. The algorithm was then modified to use a 
VectorIterableWriter for temporary storage and for its output, instead of a 
huge in-memory matrix.

The patch also partially fixes MAHOUT-369: the returned eigenvalues should now 
correspond to the eigenvectors. Still, one less of each is returned (see TODO 
in code - removing the "-1" fails the unit-tests... didn't look into it).

Two issues:
1) Existing unit-tests pass. However, as commented in MAHOUT-369, unit-tests 
for this package are far from complete. Unfortunately, my usual datasets were 
rendered unusable by recent changes in Mahout vector serialization, and I 
haven't the time to generate fictitious ones...

2) With this patch, the memory issue should be a thing of the past. However, 
with extremely large datasets a new computational issue may surface: iterating 
over a large disk-based dataset 'desiredRank' times (see the loop right below 
the TODO). This may be worked-around by rewriting this code as a MR job, but is 
outside the scope of this patch.

Jake, I'd appreciate any input you may have. It would also be very reassuring 
if you find the time to run some tests on real data you may have. And of 
course, your "seal of approval" if you think it does the trick. I might have 
some more time to work on it this Sunday (GMT+2), so any input till then would 
be greatly appreciated.

> Improve Lanczos to handle extremely large feature sets (without hashing)
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-308
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-308
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.4
>
>
> DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the 
> driver (client) computer while Hadoop is iterating.  The memory requirements 
> of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for 
> desiredRank = a few hundred, starts to cap out usefulness at 
> some-small-number * millions of columns for most commodity hardware.
> The solution (without doing stochastic decomposition) is to persist the 
> Lanczos basis to disk, except for the most recent two vectors.  Some care 
> must be taken in the "orthogonalizeAgainstBasis()" method call, which uses 
> the entire basis.  This part would be slower this way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to