Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Dmitriy Lyubimov Mon, 21 Jul 2014 14:35:18 -0700

so let me see if i am following.

you create an int-keyed matrix with gaps in int sequence and you feel
somehow this is a problem and trying to insert missing rows? is that it?


if yes, why do you believe you need to insert missing rows ?


On Mon, Jul 21, 2014 at 2:26 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> rbind appends another matrix, this seems wrong in this case anyway since
> the missing rows are anywhere in the Int key sequence.
>
> However the existing CheckpointedDrm already has an rdd and non-empty
> partitions and so should work for %*% and .t or there’s a bug, right?
>
> This is why I wanted to create an op or method that only changes nrow. The
> current prototype/hack implementation in CheckpointedDrmSpark does this:
>
>   override def rowCardinality(n: Int): CheckpointedDrm[K] = {
>     assert(n > -1)
>     new CheckpointedDrmSpark[K](rdd, n, ncol, _cacheStorageLevel )
>   }
>
> Maybe it should be called something else or be an op and maybe it should
> only work for Int keys. Given all that doing the above should work or
> there’s a bug in the math, right?
>
>
> On Jul 21, 2014, at 2:12 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> for the record, parallelizeEmpty does not create partition-less rdd -- it
> does create empty rows. The reason is that partitons are not just data,
> they are task embodiment as well. So it is a way, e.g., to generate a
> random matrix in a distributed way.
>
> I am also not 100% positive lack of rows will not present a problem.
>
> I know that empty partitions present a problem -- and if any techniques
> imply that row-less partitions are resulting, this may be a problem (in
> some situations they are explicitly filtered out post-op).
>
>
>
> On Mon, Jul 21, 2014 at 2:08 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>
> > I think we are straight on this finally.
> >
> > DRMs and rdds don’t need to embody every row at least when using
> > sequential Int keys, they are not corrupt if some rows are missing.
> >
> > Therefore rbind of drmParallelizeEmpty will work since it will only
> create
> > a CheckpointedDrm where nrow is modified. It will not modify the rdd.
> >
> > If we had to modify the rdd rbind would not work since the missing keys
> > are interspersed throughout the matrices, not all at the end. So the
> > hypothetically created rdd elements would have had the wrong Int keys.
> But
> > no need to worry about this now--no need to modify the rdd.
> >
> >
> > On Jul 21, 2014, at 1:05 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> >
> > agree, rbind and cbind are the ways to tweak geometry
> >
> >
> > On Mon, Jul 21, 2014 at 12:24 PM, Anand Avati <av...@gluster.org> wrote:
> >
> >> The summary of the discussion is:
> >>
> >> Pat encountered a scenario where matrix multiplication was erroring
> >> because of mismatching A's rows and B's cols. His solution was to
> >> fixup/fudge A's nrow value to force the multiplication to happen. I
> think
> >> such a fixup of rows is better done through an rbind() like operator
> > (with
> >> an empty B matrix) instead of "editing" nrow member. However the problem
> >> seems to be that A's rows are fewer than desired because they have
> > missing
> >> rows (i.e, the int keys sequence has holes). I think such an object is
> >> corrupted to begin with. And even if you were to fudge nrow, OpAewScalar
> >> gives math errors (as demonstrated in code example), and AewB, CbindAB
> > are
> >> giving runtime exceptions on the cogroup() RDD api. I guess Pat still
> > feels
> >> these errors/exceptions must be fixed by filing a Jira.
> >>
> >>
> >>
> >>
> >> On Mon, Jul 21, 2014 at 11:49 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> >> wrote:
> >>
> >>> Sorry i did not bear with all the discussion, but this change doesn't
> > make
> >>> sense to me.
> >>>
> >>> It is not algebraic, it is not R, and it also creates algebraically
> >>> incorrect object
> >>>
> >>> On the topic of the "empty" rows, remember they are not really empty,
> > they
> >>> are matrices with 0.0 elements, and "emptyness" is just a compaction
> >>> scheme
> >>> that also happens to have some optimization meaning to various
> algebraic
> >>> operations.
> >>>
> >>> So "empty" matrix is really an absolutely valid matrix. It may cause
> >>> various mathematical exceptions since it is rank-deficient though, but
> >>> there are no "mechanical" errors with that representation, so i am not
> >>> sure
> >>> what this dicussion was all about (but then again, i had no time to
> read
> >>> it
> >>> all).
> >>>
> >>>
> >>> On Mon, Jul 21, 2014 at 11:06 AM, ASF GitHub Bot (JIRA) <
> > j...@apache.org>
> >>> wrote:
> >>>
> >>>>
> >>>>   [
> >>>>
> >>>
> >
> https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068936#comment-14068936
> >>>> ]
> >>>>
> >>>> ASF GitHub Bot commented on MAHOUT-1541:
> >>>> ----------------------------------------
> >>>>
> >>>> Github user pferrel commented on a diff in the pull request:
> >>>>
> >>>>   https://github.com/apache/mahout/pull/31#discussion_r15184775
> >>>>
> >>>>   --- Diff:
> >>>>
> >>>
> >
> spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
> >>>> ---
> >>>>   @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
> >>>>      private var cached: Boolean = false
> >>>>      override val context: DistributedContext = rdd.context
> >>>>
> >>>>   +  /**
> >>>>   +   * Adds the equivalent of blank rows to the sparse
> >>> CheckpointedDrm,
> >>>> which only changes the
> >>>>   +   * [[org.apache.mahout.sparkbindings.drm
> >>>>   +.CheckpointedDrmSpark#nrow]] value.
> >>>>   +   * No physical changes are made to the underlying rdd, now blank
> >>>> rows are added as would be done with rbind(blankRows)
> >>>>   +   * @param n number to increase row cardinality by
> >>>>   +   * @note should be done before any BLAS optimizer actions are
> >>>> performed on the matrix or you'll get unpredictable
> >>>>   +   *       results.
> >>>>   +   */
> >>>>   +  override def addToRowCardinality(n: Int): CheckpointedDrm[K] = {
> >>>>   +    assert(n > -1)
> >>>>   +    new CheckpointedDrmSpark[K](rdd, nrow + n, ncol,
> >>>> _cacheStorageLevel )
> >>>>   +  }
> >>>>   --- End diff --
> >>>>
> >>>>   I see no fundamental reason for these not to work but it may not be
> >>>> part of the DRM contract. So maybe I'll make a feature request Jira to
> >>>> support this.
> >>>>
> >>>>   In the meantime rbind will not solve this because A will have
> >>> missing
> >>>> rows at the end but B may have them throughout--let alone some future
> >>> C. So
> >>>> I think reading in all data into one drm with one row and column id
> >>> space
> >>>> then chopping into two or more drms based on column ranges should give
> >>> us
> >>>> empty rows where they are needed (I certainly hope so or I'm in
> >>> trouble).
> >>>> Will have to keep track of which column ids go in which slice but
> > that's
> >>>> doable.
> >>>>
> >>>>
> >>>>> Create CLI Driver for Spark Cooccurrence Analysis
> >>>>> -------------------------------------------------
> >>>>>
> >>>>>               Key: MAHOUT-1541
> >>>>>               URL:
> >>> https://issues.apache.org/jira/browse/MAHOUT-1541
> >>>>>           Project: Mahout
> >>>>>        Issue Type: New Feature
> >>>>>        Components: CLI
> >>>>>          Reporter: Pat Ferrel
> >>>>>          Assignee: Pat Ferrel
> >>>>>
> >>>>> Create a CLI driver to import data in a flexible manner, create an
> >>>> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> >>>> CooccurrenceAnalysis with the appropriate params, then write output
> > with
> >>>> external IDs optionally reattached.
> >>>>> Ultimately it should be able to read input as the legacy mr does but
> >>>> will support reading externally defined IDs and flexible formats.
> > Output
> >>>> will be of the legacy format or text files of the user's specification
> >>> with
> >>>> reattached Item IDs.
> >>>>> Support for legacy formats is a question, users can always use the
> >>>> legacy code if they want this. Internal to the IndexedDataset is a
> > Spark
> >>>> DRM so pipelining can be accomplished without any writing to an actual
> >>> file
> >>>> so the legacy sequence file output may not be needed.
> >>>>> Opinions?
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> This message was sent by Atlassian JIRA
> >>>> (v6.2#6252)
> >>>>
> >>>
> >>
> >>
> >
> >
>
>

Re: [jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to