[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543437#comment-15543437
 ] 

Dmitriy Lyubimov commented on MAHOUT-1884:
------------------------------------------



Which api is this about specifically?

wrapping existing RDD (drmWrap() api) supports this. 

Also note that for drms off disk, these are one-pass computations that are of 
cost no more than RDD$count(). Since for any dataset we call dfsRead(), the 
obvious intent is to use it, loading & caching is not doing any harm as that's 
what would happen anyway.

also, matrix dimensions are the most obvious ones but not everything that 
optimizer may need to analyze about the dataset (lazily). There are more 
heuristics about datasets that drmWrap() accepts (and even more that it 
doesn't). 

if we are talking about cases where drmWrap() cannot be used for some reason, 
we probably should request metadata equivalent to what drmWrap() does, not just 
ncol, nrow.

> Allow specification of dimensions of a DRM
> ------------------------------------------
>
>                 Key: MAHOUT-1884
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1884
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.12.2
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>            Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to