In the application, the number of rows will always be increased, adding blank rows. I don’t think shuffle is necessary in this case because there is no actual row, no data in the drm it’s just needed to make the cardinality match, the IDs will take care of data matching . Maybe calling it something else is a good idea to emphasize the special case for it’s use. I went over this with Dmitriy and, though I haven’t checked actual values on large datasets, it works. On Jul 14, 2014, at 11:04 AM, Anand Avati <[email protected]> wrote:
On Mon, Jul 14, 2014 at 10:58 AM, Ted Dunning <[email protected]> wrote: On Mon, Jul 14, 2014 at 9:47 AM, Pat Ferrel <[email protected]> wrote: > BTW that requires that drm.nrow be mutable. That is defined as immutable > in the DSL and so will require a change to several traits. I’ve done this > but am still trying to decide the cleanest. Hmmm.... immutability has lots of virtues. And changing nrows is just the tip of the iceberg. You also have to shuffle the rows to match the row partitioning between the two matrices. Or it requires more than one pass through the data. Since you have to read both matrices before you can deal with either, and since one matrix is likely to be shuffled relative to the other, might it just be better to either do two read passes or pay the cost to shuffle the matrices after getting a consensus view. Note that the second read pass will have to do a shuffle any way so the only savings to doing two passes is to decrease memory usage. *Anand,* I think I remember you were addressing a shuffle problem in some of your earlier work. What did you conclude? I think the larger question is, what does it mean to make drm.nrow mutable. If changed to a smaller value, which rows do you "sacrifice". Why not just do a RowRange operation to get a new DRM with fewer rows (instead of mutating the given drm)? After that, if you care specifically about partitioning the Par operator can shuffle data for you.
