[
https://issues.apache.org/jira/browse/MAHOUT-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582402#comment-14582402
]
ASF GitHub Bot commented on MAHOUT-1660:
----------------------------------------
Github user andrewmusselman commented on a diff in the pull request:
https://github.com/apache/mahout/pull/135#discussion_r32257679
--- Diff:
math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala ---
@@ -73,20 +75,39 @@ trait DistributedEngine {
def drmDfsRead(path: String, parMin: Int = 0)(implicit sc:
DistributedContext): CheckpointedDrm[_]
/** Parallelize in-core matrix as spark distributed matrix, using row
ordinal indices as data set keys. */
- def drmParallelizeWithRowIndices(m: Matrix, numPartitions: Int = 1)
- (implicit sc: DistributedContext): CheckpointedDrm[Int]
+ def drmParallelizeWithRowIndices(m: Matrix, numPartitions: Int =
1)(implicit sc: DistributedContext):
+ CheckpointedDrm[Int]
/** Parallelize in-core matrix as spark distributed matrix, using row
labels as a data set keys. */
- def drmParallelizeWithRowLabels(m: Matrix, numPartitions: Int = 1)
- (implicit sc: DistributedContext): CheckpointedDrm[String]
+ def drmParallelizeWithRowLabels(m: Matrix, numPartitions: Int =
1)(implicit sc: DistributedContext):
+ CheckpointedDrm[String]
/** This creates an empty DRM with specified number of partitions and
cardinality. */
- def drmParallelizeEmpty(nrow: Int, ncol: Int, numPartitions: Int = 10)
- (implicit sc: DistributedContext): CheckpointedDrm[Int]
+ def drmParallelizeEmpty(nrow: Int, ncol: Int, numPartitions: Int =
10)(implicit sc: DistributedContext):
+ CheckpointedDrm[Int]
/** Creates empty DRM with non-trivial height */
- def drmParallelizeEmptyLong(nrow: Long, ncol: Int, numPartitions: Int =
10)
- (implicit sc: DistributedContext): CheckpointedDrm[Long]
+ def drmParallelizeEmptyLong(nrow: Long, ncol: Int, numPartitions: Int =
10)(implicit sc: DistributedContext):
+ CheckpointedDrm[Long]
+
+ /**
+ * Convert non-int-keyed matrix to an int-keyed, computing optionally
mapping from old keys
+ * to row indices in the new one. The mapping, if requested, is returned
as a 1-column matrix.
+ */
+ def drm2IntKeyed[K: ClassTag](drmX: DrmLike[K], computeMap: Boolean =
false): (DrmLike[Int], Option[DrmLike[K]])
+
+ /**
+ * (Optional) Sampling operation. Consistent with Spark semantics of the
same.
+ * @param drmX
+ * @param fraction
+ * @param replacement
+ * @tparam K
+ * @return
+ */
+ def drmSampleRows[K: ClassTag](drmX: DrmLike[K], fraction: Double,
replacement: Boolean = false): DrmLike[K]
+
+ def drmSampleKRows[K: ClassTag](drmX: DrmLike[K], numSamples:Int,
replacement:Boolean = false) : Matrix
--- End diff --
Why does this return a Matrix whereas the previous one returns DrmLike[K],
and is there a default number of samples in the previous one?
> Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop conf
> ----------------------------------------------------------
>
> Key: MAHOUT-1660
> URL: https://issues.apache.org/jira/browse/MAHOUT-1660
> Project: Mahout
> Issue Type: Bug
> Components: spark
> Affects Versions: 0.10.0
> Reporter: Suneel Marthi
> Assignee: Dmitriy Lyubimov
> Priority: Minor
> Fix For: 0.10.2
>
>
> Hadoop1HDFSUtil.readDRMHEader should be taking Hadoop configuration from
> Context and not ignore it
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)