Github user smurching commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r146731101
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tree.impl
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.ml.tree.{LearningNode, Split}
+import org.apache.spark.util.collection.BitSet
+
+/**
+ * Maintains intermediate state of data (columns) and tree during local
tree training.
+ * Primary local tree training data structure; contains all information
required to describe
+ * the state of the algorithm at any point during learning.??
+ *
+ * Nodes are indexed left-to-right along the periphery of the tree, with
0-based indices.
+ * The "periphery" is the set of leaf nodes (active and inactive).
+ *
+ * @param columns Array of columns.
+ * Each column is sorted first by nodes (left-to-right
along the tree periphery);
+ * all columns share this first level of sorting.
+ * Within each node's group, each column is sorted based
on feature value;
+ * this second level of sorting differs across columns.
+ * @param instanceWeights Array of weights for each training example
+ * @param nodeOffsets Offsets into the columns indicating the first level
of sorting (by node).
+ * The rows corresponding to the node activeNodes(i)
are in the range
+ * [nodeOffsets(i)(0), nodeOffsets(i)(1)) .
+ * @param activeNodes Nodes which are active (still being split).
+ * Inactive nodes are known to be leaves in the final
tree.
+ */
+private[impl] case class TrainingInfo(
+ columns: Array[FeatureVector],
+ instanceWeights: Array[Double],
--- End diff --
Good call, I'll move `instanceWeights` outside `TrainingInfo`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]