Re: Problem of dimensions

Pat Ferrel Wed, 16 Jul 2014 11:48:42 -0700

> 
> On Jul 16, 2014, at 8:34 AM, Anand Avati <[email protected]> wrote:
> 
> On Wed, Jul 16, 2014 at 7:53 AM, Pat Ferrel <[email protected]> wrote:
> There IS no issue with nrow being a lazy val. I never touch it read below.
> 
> The value itself is may not be immutable. But it sounds like the same matrix 
> would return different values for nrow() depending on when you called it. 
> That sounds very much like a problem if the same matrix is part of two 
> separate spark graph hierarchies where each is making a different assumption 
> about its cardinality.


My understanding of a CheckpointeDrmSpark is to assure that the rdd is 
checkpointed so no optimizer graph at that point. Yes it would produce 
different results but so would the matrix pre and post rbind.

There are serveral matrix reads, that create one dictionary for all rows in all 
matrices. The length of this dictionary is the row cardinality for all 
matrices. Each CheckpointedDrmSpark has the rows with Int IDs/keys for the data 
it knows about and so on. The cardinality is changed before any operation is 
performed. 

>  
> creating a new matrix val is fine if it doesn’t cause a new rdd to be created 
> I’ll look into that.
> 
> rbind as I read it requires me to construct the rows to be added. I don’t 
> know what their keys are and don’t want to calculate them. If I’m right about 
> how the math works the actual rows are not needed.
> 
> See the example code to add empty rows. Numerical keys are auto computed 
> based on the sizes.
> 

I’m pretty sure that’s impossible for this case. The only way to calculate the 
true cardinality is read in all data for all matrices.The only way to know what 
IDs are _missing_ from a single DRM is to read them all in. The internal Int 
IDs are different for each matrix and there is no telling which is in which. 
I’d have to keep the unified row dictionary and some collection for each matrix 
then subtract the smaller from the unified to get the missing keys. Only then 
could I create empty rows with rbind using the missing keys. The users you see 
for a specific interaction recorded in the several matrices may bear no 
relationship any of the others. In fact that’s what cooccurrence may tell you. 

Any operation performed on the drms before these hypothetical rbinds are 
performed will be wonky anyway so I miss the point about that.

If my solution works (forcing cardinality without actually adding empty rows) 
it is far far simpler than keeping these extra collections and calculating the 
IDs for rbind. If I’m wrong, what’s described above should work.

This is for one rather unique pipeline so my shortcut may not be safe for any 
use that is my worry. We’ll find out.

Re: Problem of dimensions

Reply via email to