[ 
https://issues.apache.org/jira/browse/MAHOUT-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546663#comment-15546663
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1884 at 10/4/16 9:10 PM:
-------------------------------------------------------------------

drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.



was (Author: dlyubimov):
drmWrap is not internal in the least (which is why it is not package-private). 
it is public and intended for plugging external general sources into input 
barrier of the optimizer/

loading in memory would happen anyway. Caching is not necessarily -- but it is 
not guaranteed not to happen, there's no such contract. 

Materially it only makes any difference if the input is larger than avaialble 
cluster capacity. Which is I am yet to encounter as algebraic tasks are CPU and 
io bound, but not memory. Usually we run out of IO and CPU much sooner that we 
run out of memory, which makes this situation pragmatically unrealistic. 

note that optimizer should --and will -- retain control over caching. we don't 
have explicit caching api except for checkpoint "hints" but even that is only a 
hint, not guaranteed. Giving it some heuristics about dataset doesn't guarantee 
that it won't compute others or won't cache or sample for some other reason, 
now or in the future. 

This siutation is fine as it is one of the function of optimizer, as much as 
choosing degrees of parallelization, product task sizes or operators to 
execute. Making those choices automatically is, actually, the point. As long as 
optimizer does right enough things, that should be ok. 

Bottom line, i don't see harm in adding _optional_ ncol and nrow to drmDfsRead 
specifically. But I do not see a tangible benefit either. There's possibly only 
a slight benefit right now (no no-cache or no-sample guarantee), which likely 
only decrease in the future. I am fine with it as understood there's no 
"no-cache" contract anywhere.


> Allow specification of dimensions of a DRM
> ------------------------------------------
>
>                 Key: MAHOUT-1884
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1884
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.12.2
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>            Priority: Minor
>
> Currently, in many cases, a DRM must be read to compute its dimensions when a 
> user calls nrow or ncol. This also implicitly caches the corresponding DRM.
> In some cases, the user actually knows the matrix dimensions (e.g., when the 
> matrices are synthetically generated, or when some metadata about them is 
> known). In such cases, the user should be able to specify the dimensions upon 
> creating the DRM and the caching should be avoided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to