tqchen commented on a change in pull request #8817:
URL: https://github.com/apache/tvm/pull/8817#discussion_r694853644
##########
File path: src/tir/schedule/concrete_schedule.cc
##########
@@ -208,6 +211,25 @@ Schedule ConcreteScheduleNode::Copy() const {
}
/******** Schedule: Schedule: Sampling ********/
+
+void ConcreteScheduleNode::Seed(support::LinearCongruentialEngine::TRandState
seed) {
+ support::LinearCongruentialEngine(&rand_state_).Seed(seed == -1 ?
std::random_device()() : seed);
+}
+support::LinearCongruentialEngine::TRandState ConcreteScheduleNode::ForkSeed()
{
+ // In order for reproducibility, we computer the new seed using RNG's random
state and a different
+ // set of parameters. Note that both 32767 and 1999999973 are prime numbers.
+ return (support::LinearCongruentialEngine(&rand_state_)() * 32767) %
1999999973;
+}
Review comment:
Coming late to the discussion. I read the thread yesterday evening and
wanted
to spend more time thinking so did not reply immediately.
Please let me try to summarize a few points from the current discussion:
As we generate random numbers, PRNG state circles through the space of states
and eventually will come back to its initial state to form a circle.
When we split the state of PRNG to start parallel generators, if the
splitted states are "too close" in terms of their traversal distance,
then we get repeatitions, or streams that are correlated to each other.
The following pts about the PRNGs
- A0: When we split a PRNG in an adhoc way, there is a "possibility"
that repeats can happen in different streams. The nature of such
possibility depends on the splitting method, the seed and PRNG itself.
- A1: There is currently no proof of splitting LCG in general.
Because of the above two pts, we need to consider the implication of
using a possibly correlated PRNG streams. In our case, we use PRNG
to generate explorations in the search space and perform the following task:
- T0: explore the space and find possible maxima in the space
To make things simple, let us assume that there are two streams in the 32
threads
that are exactly identical to each other. In this case, we waste the
computation
from one of the thread, because it will result in exactly the same set of
samples.
Because our goal is to explore the space and find a maximum point. This
would mean
that the computation from that particular thread is wasted, and we get less
statistical
efficiency from the sampling process.
At a high level, we have two parameters we can tweak, the number of sampling
steps n,
and number of threads K. What we are saying is that the statistical
efficiency of running
uniform sampling over the search space scales linearly with n, but perhaps
less optimally
wrt to K. For other more complicated tasks, such as estimating density of
certain regions.
We know that the sampling and averaging over time(n) always can give us the
right estimation.
Correlation across streams, would make averaging over streams(K) less
efficient because the
random numbers are not independent, but we will still get the correct mean
as we increase n.
So in summary:
- A2: The impact of possibly correlated streams means that we could get less
than
full-linear efficiency in terms of number of thread K. The sampling
will still
effective samples wrt to n(the number of samples we take in each
thread).
Note that the real sampling scenario is much more complicated. As junru's
experiments
showed that repeatition did not happen on the particular case for quite long
period of time.
Additionally, the sampling steps are dependent on each other(they form a
Markov Chain),
so it is hard to tell the end effect correlation between random sequence
(abcde) and (bcdef),
even if they are one step apart. While in most of the cases they can be many
steps apart
emperically. To summarize
- A3: In this particular usecase, emperically we get non-repetitively
sequences among the streams.
This does not preclude that correlation won't happen(as it is not a
proof), it does suggest
that correlation may not be large in most cases.
The end effect of A2 has a quite close analogy in parallel computing: as we
start to use
K thread, we may not exactly get Kx speedups. Except that in this case it is
not due to
hardware reasons, it is due to the possible correlation. As in parallel
computing,
we can run the program longer, in our case increase n to compensate the
possible loss
of efficiency. In short, A2 and A3 together might suggest that parallel
stream correlation may not be the problem
that we need to perfectly solve, as long as it does not become very bad(e.g.
all streams are the same).
My read is that @tkonolige is right about A0 and A1 and seems we also agree.
Yesterday I did not think of A2 in particular, which might change our
perspective. So I would
like to share this summary here. Would be great to get your takes as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]