[
https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987036#action_12987036
]
Sean Owen commented on MAHOUT-308:
----------------------------------
This one's also going stale. Jake, do you have thoughts on this? I imagine the
patch needs to be updated again, but, worth discussing whether it is something
you'd like to commit before making that effort.
> Improve Lanczos to handle extremely large feature sets (without hashing)
> ------------------------------------------------------------------------
>
> Key: MAHOUT-308
> URL: https://issues.apache.org/jira/browse/MAHOUT-308
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.3
> Environment: all
> Reporter: Jake Mannix
> Assignee: Jake Mannix
> Fix For: 0.5
>
> Attachments: MAHOUT-308.patch
>
>
> DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the
> driver (client) computer while Hadoop is iterating. The memory requirements
> of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for
> desiredRank = a few hundred, starts to cap out usefulness at
> some-small-number * millions of columns for most commodity hardware.
> The solution (without doing stochastic decomposition) is to persist the
> Lanczos basis to disk, except for the most recent two vectors. Some care
> must be taken in the "orthogonalizeAgainstBasis()" method call, which uses
> the entire basis. This part would be slower this way.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.