so let me see if i am following. you create an int-keyed matrix with gaps in int sequence and you feel somehow this is a problem and trying to insert missing rows? is that it?
if yes, why do you believe you need to insert missing rows ? On Mon, Jul 21, 2014 at 2:26 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > rbind appends another matrix, this seems wrong in this case anyway since > the missing rows are anywhere in the Int key sequence. > > However the existing CheckpointedDrm already has an rdd and non-empty > partitions and so should work for %*% and .t or there’s a bug, right? > > This is why I wanted to create an op or method that only changes nrow. The > current prototype/hack implementation in CheckpointedDrmSpark does this: > > override def rowCardinality(n: Int): CheckpointedDrm[K] = { > assert(n > -1) > new CheckpointedDrmSpark[K](rdd, n, ncol, _cacheStorageLevel ) > } > > Maybe it should be called something else or be an op and maybe it should > only work for Int keys. Given all that doing the above should work or > there’s a bug in the math, right? > > > On Jul 21, 2014, at 2:12 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > for the record, parallelizeEmpty does not create partition-less rdd -- it > does create empty rows. The reason is that partitons are not just data, > they are task embodiment as well. So it is a way, e.g., to generate a > random matrix in a distributed way. > > I am also not 100% positive lack of rows will not present a problem. > > I know that empty partitions present a problem -- and if any techniques > imply that row-less partitions are resulting, this may be a problem (in > some situations they are explicitly filtered out post-op). > > > > On Mon, Jul 21, 2014 at 2:08 PM, Pat Ferrel <pat.fer...@gmail.com> wrote: > > > I think we are straight on this finally. > > > > DRMs and rdds don’t need to embody every row at least when using > > sequential Int keys, they are not corrupt if some rows are missing. > > > > Therefore rbind of drmParallelizeEmpty will work since it will only > create > > a CheckpointedDrm where nrow is modified. It will not modify the rdd. > > > > If we had to modify the rdd rbind would not work since the missing keys > > are interspersed throughout the matrices, not all at the end. So the > > hypothetically created rdd elements would have had the wrong Int keys. > But > > no need to worry about this now--no need to modify the rdd. > > > > > > On Jul 21, 2014, at 1:05 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > > > agree, rbind and cbind are the ways to tweak geometry > > > > > > On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote: > > > >> The summary of the discussion is: > >> > >> Pat encountered a scenario where matrix multiplication was erroring > >> because of mismatching A's rows and B's cols. His solution was to > >> fixup/fudge A's nrow value to force the multiplication to happen. I > think > >> such a fixup of rows is better done through an rbind() like operator > > (with > >> an empty B matrix) instead of "editing" nrow member. However the problem > >> seems to be that A's rows are fewer than desired because they have > > missing > >> rows (i.e, the int keys sequence has holes). I think such an object is > >> corrupted to begin with. And even if you were to fudge nrow, OpAewScalar > >> gives math errors (as demonstrated in code example), and AewB, CbindAB > > are > >> giving runtime exceptions on the cogroup() RDD api. I guess Pat still > > feels > >> these errors/exceptions must be fixed by filing a Jira. > >> > >> > >> > >> > >> On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com> > >> wrote: > >> > >>> Sorry i did not bear with all the discussion, but this change doesn't > > make > >>> sense to me. > >>> > >>> It is not algebraic, it is not R, and it also creates algebraically > >>> incorrect object > >>> > >>> On the topic of the "empty" rows, remember they are not really empty, > > they > >>> are matrices with 0.0 elements, and "emptyness" is just a compaction > >>> scheme > >>> that also happens to have some optimization meaning to various > algebraic > >>> operations. > >>> > >>> So "empty" matrix is really an absolutely valid matrix. It may cause > >>> various mathematical exceptions since it is rank-deficient though, but > >>> there are no "mechanical" errors with that representation, so i am not > >>> sure > >>> what this dicussion was all about (but then again, i had no time to > read > >>> it > >>> all). > >>> > >>> > >>> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) < > > j...@apache.org> > >>> wrote: > >>> > >>>> > >>>> [ > >>>> > >>> > > > https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936 > >>>> ] > >>>> > >>>> ASF GitHub Bot commented on MAHOUT-1541: > >>>> ---------------------------------------- > >>>> > >>>> Github user pferrel commented on a diff in the pull request: > >>>> > >>>> https://github.com/apache/mahout/pull/31#discussion_r15184775 > >>>> > >>>> --- Diff: > >>>> > >>> > > > spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala > >>>> --- > >>>> @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag]( > >>>> private var cached: Boolean = false > >>>> override val context: DistributedContext = rdd.context > >>>> > >>>> + /** > >>>> + * Adds the equivalent of blank rows to the sparse > >>> CheckpointedDrm, > >>>> which only changes the > >>>> + * [[org.apache.mahout.sparkbindings.drm > >>>> +.CheckpointedDrmSpark#nrow]] value. > >>>> + * No physical changes are made to the underlying rdd, now blank > >>>> rows are added as would be done with rbind(blankRows) > >>>> + * @param n number to increase row cardinality by > >>>> + * @note should be done before any BLAS optimizer actions are > >>>> performed on the matrix or you'll get unpredictable > >>>> + * results. > >>>> + */ > >>>> + override def addToRowCardinality(n: Int): CheckpointedDrm[K] = { > >>>> + assert(n > -1) > >>>> + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, > >>>> _cacheStorageLevel ) > >>>> + } > >>>> --- End diff -- > >>>> > >>>> I see no fundamental reason for these not to work but it may not be > >>>> part of the DRM contract. So maybe I'll make a feature request Jira to > >>>> support this. > >>>> > >>>> In the meantime rbind will not solve this because A will have > >>> missing > >>>> rows at the end but B may have them throughout--let alone some future > >>> C. So > >>>> I think reading in all data into one drm with one row and column id > >>> space > >>>> then chopping into two or more drms based on column ranges should give > >>> us > >>>> empty rows where they are needed (I certainly hope so or I'm in > >>> trouble). > >>>> Will have to keep track of which column ids go in which slice but > > that's > >>>> doable. > >>>> > >>>> > >>>>> Create CLI Driver for Spark Cooccurrence Analysis > >>>>> ------------------------------------------------- > >>>>> > >>>>> Key: MAHOUT-1541 > >>>>> URL: > >>> https://issues.apache.org/jira/browse/MAHOUT-1541 > >>>>> Project: Mahout > >>>>> Issue Type: New Feature > >>>>> Components: CLI > >>>>> Reporter: Pat Ferrel > >>>>> Assignee: Pat Ferrel > >>>>> > >>>>> Create a CLI driver to import data in a flexible manner, create an > >>>> IndexedDataset with BiMap ID translation dictionaries, call the Spark > >>>> CooccurrenceAnalysis with the appropriate params, then write output > > with > >>>> external IDs optionally reattached. > >>>>> Ultimately it should be able to read input as the legacy mr does but > >>>> will support reading externally defined IDs and flexible formats. > > Output > >>>> will be of the legacy format or text files of the user's specification > >>> with > >>>> reattached Item IDs. > >>>>> Support for legacy formats is a question, users can always use the > >>>> legacy code if they want this. Internal to the IndexedDataset is a > > Spark > >>>> DRM so pipelining can be accomplished without any writing to an actual > >>> file > >>>> so the legacy sequence file output may not be needed. > >>>>> Opinions? > >>>> > >>>> > >>>> > >>>> -- > >>>> This message was sent by Atlassian JIRA > >>>> (v6.2#6252) > >>>> > >>> > >> > >> > > > > > >