tqchen commented on a change in pull request #8817:
URL: https://github.com/apache/tvm/pull/8817#discussion_r694853644



##########
File path: src/tir/schedule/concrete_schedule.cc
##########
@@ -208,6 +211,25 @@ Schedule ConcreteScheduleNode::Copy() const {
   }
 
 /******** Schedule: Schedule: Sampling ********/
+
+void ConcreteScheduleNode::Seed(support::LinearCongruentialEngine::TRandState 
seed) {
+  support::LinearCongruentialEngine(&rand_state_).Seed(seed == -1 ? 
std::random_device()() : seed);
+}
+support::LinearCongruentialEngine::TRandState ConcreteScheduleNode::ForkSeed() 
{
+  // In order for reproducibility, we computer the new seed using RNG's random 
state and a different
+  // set of parameters. Note that both 32767 and 1999999973 are prime numbers.
+  return (support::LinearCongruentialEngine(&rand_state_)() * 32767) % 
1999999973;
+}

Review comment:
       Coming late to the discussion. I read the thread yesterday evening and 
wanted
   to spend more time thinking so did not reply immediately. 
   
   Please let me try to summarize and share some thoughts
   
   As we generate random numbers, PRNG state circles through the space of states
   and eventually will come back to its initial state to form a circle.
   When we split the state of PRNG to start parallel generators, if the 
   splitted states are "too close" in terms of their traversal distance,
   then we get repeatitions, or streams that are correlated to each other.
   
   The following pts about the PRNGs
   
   - A0: When we split a PRNG in an adhoc way, there is a "possibility"
         that repeats can happen in different streams. The nature of such
         possibility depends on the splitting method, the seed and PRNG itself.
   - A1: There is currently no proof of splitting LCG in general.
   
   Because of the above two pts, we need to consider the implication of 
   using a possibly correlated PRNG streams. In our case, we use PRNG
   to generate explorations in the search space and perform the following task:
   - T0: explore the space and find possible maxima in the space
   
   To make things simple, let us assume that there are two streams in the 32 
threads
   that are exactly identical to each other. In this case, we waste the 
computation
   from one of the thread, because it will result in exactly the same set of 
samples.
   Because our goal is to explore the space and find a maximum point. This 
would mean
   that the computation from that particular thread is wasted, and we get less 
statistical
   efficiency from the sampling process.
   
   At a high level, we have two parameters we can tweak, the number of sampling 
steps n,
   and number of threads K. What we are saying is that the statistical 
efficiency of running
   uniform sampling over the search space scales linearly with n, but perhaps 
less optimally
   wrt to K. For other more complicated tasks, such as estimating density of 
certain regions.
   We know that the sampling and averaging over time(n) always can give us the 
right estimation.
   Correlation across streams, would make averaging over streams(K) less 
efficient because the
   random numbers are not independent, but we will still get the correct mean 
as we increase n.
   
   So in summary:
   
   - A2: The impact of possibly correlated streams means that we could get less 
than
         full-linear efficiency in terms of number of thread K. The sampling 
will still    
         effective samples wrt to n(the number of samples we take in each 
thread).
          
   Note that the real sampling scenario is much more complicated. As junru's 
experiments
   showed that repeatition did not happen on the particular case for quite long 
period of time.
   Additionally, the sampling steps are dependent on each other(they form a 
Markov Chain), 
   so it is hard to tell the end effect correlation between random sequence 
(abcde) and (bcdef),
   even if they are one step apart. While in most of the cases they can be many 
steps apart
   emperically. To summarize
   
   
   - A3: In this particular usecase, emperically we get non-repetitively 
sequences among the streams.
         This does not preclude that correlation won't happen(as it is not a 
proof), it does suggest
         that correlation may not be large in most cases.
   
   The end effect of A2 has a quite close analogy in parallel computing: as we 
start to use
   K thread, we may not exactly get Kx speedups. Except that in this case it is 
not due to
   hardware reasons, it is due to the possible correlation. As in parallel 
computing, 
   we can run the program longer, in our case increase n to compensate the 
possible loss
   of efficiency. In short, A2 and A3 together might suggest that parallel 
stream correlation may not be the problem
   that we need to perfectly solve, as long as it does not become very bad(e.g. 
all streams are the same).
   
   My read is that @tkonolige is right about A0 and A1 and seems we also agree. 
   Yesterday I did not think of A2 in particular, which might change our 
perspective. So I would
   like to share this summary here. Would be great to get your takes as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to