In the application, the number of rows will always be increased, adding blank 
rows. I don’t think shuffle is necessary in this case because there is no 
actual row, no data in the drm it’s just needed to make the cardinality match, 
the IDs will take care of data matching . Maybe calling it something else is a 
good idea to emphasize the special case for it’s use. I went over this with 
Dmitriy and, though I haven’t checked actual values on large datasets, it works.
 
On Jul 14, 2014, at 11:04 AM, Anand Avati <[email protected]> wrote:




On Mon, Jul 14, 2014 at 10:58 AM, Ted Dunning <[email protected]> wrote:
On Mon, Jul 14, 2014 at 9:47 AM, Pat Ferrel <[email protected]> wrote:

> BTW that requires that drm.nrow be mutable. That is defined as immutable
> in the DSL and so will require a change to several traits. I’ve done this
> but am still trying to decide the cleanest.


Hmmm.... immutability has lots of virtues.  And changing nrows is just the
tip of the iceberg.  You also have to shuffle the rows to match the row
partitioning between the two matrices.

Or it requires more than one pass through the data.  Since you have to read
both matrices before you can deal with either, and since one matrix is
likely to be shuffled relative to the other, might it just be better to
either do two read passes or pay the cost to shuffle the matrices after
getting a consensus view. Note that the second read pass will have to do a
shuffle any way so the only savings to doing two passes is to decrease
memory usage.

*Anand,*

I think I remember you were addressing a shuffle problem in some of your
earlier work.  What did you conclude?

I think the larger question is, what does it mean to make drm.nrow mutable. If 
changed to a smaller value, which rows do you "sacrifice". Why not just do a 
RowRange operation to get a new DRM with fewer rows (instead of mutating the 
given drm)? After that, if you care specifically about partitioning the Par 
operator can shuffle data for you.

Reply via email to