This is exactly the way the current streaming implementations work. In the
batch case, as Walter noted, r can be set very large (large enough to
encompass all possible data in a batch case). If you did not know the size
of your data a priori in a batch case (say, in processing in Spark with
data in hdfs), you can just use the queueStream option built into the
SparkStreaming responder to give the buffer-like capability that you
Thus, I think that between the streaming 'buffer' implementation and the
variable 'r' setting in the batch case, the functionality of what you have
described is taken care of.
On Thu, Sep 22, 2016 at 4:06 PM, Walter Ray-Dulany <raydul...@gmail.com>
> I would like to hear Ellison Anne's opinion on this, but I think such a
> change has no security implications.
> While it would be a definite change in how the algorithm operates, from a
> mathematical-security perspective this is no different than picking an r
> value that it simply enormous (recall that r is the number of responses
> that can be returned per query request period per search term). The value
> of r is a constrainer, not a constraint, in this system (i.e. we are forced
> to choose delta and b such that delta/b divides r, but we've got no
> restrictions on our choice of r).
> On Thu, Sep 22, 2016 at 11:23 AM, James Carr <jryancarrp...@gmail.com>
> > Hello all,
> > I've been working through the Wideskies paper and Walter's slides. I
> > think I have an idea that would improve usability by a lot, but I want to
> > run it past you guys to see if it would work.
> > Here's my understanding of how the Wideskies Responder works:
> > - It maintains the values Y_0, ..., Y_(r-1) which hold the encrypted data
> > returned by the algorithm. From a Comp Sci perspective, I see the
> > collection of these values functioning as, basically, a big encrypted
> > buffer holding our results. It also maintains a list of counters c_0,
> > c_t, one for each possible value of hash function H.
> > - When a pair (T, D) comes in, we hash T to get H(T), and break D into
> > 8-bit chunks. We then combine each of those chunks into our Y_i values
> > using E_H(T), our encrypted value for the hash of T, which is chosen in
> > such a way that a) if we aren't interested in T, adding this chunk to Y_i
> > does not change the value it decrypts to (we essentially multiply by
> > encrypted 1); or b) if we are interested in T, E_H(T) has been chosen in
> > such a way that adding this chunk to Y_i will not overwrite any byte
> > for *other* interesting terms that have also been written to Y_i.
> > - We then increment the counter c_T so that future items that hash to the
> > same value write to different Y_i, so that we do not overwrite the value
> > just wrote.
> > - If *any *of these counters c_H(T) would exceed r due to some new pair
> > (T,D), then we can no longer ensure that we won't overwrite data we
> > saw when we add the new D to Y_0, ..., Y_(r-1). From a Comp Sci
> > perspective, this looks like "our buffer is full," since we don't have
> > for new data. Therefore, we have to terminate the algorithm and return
> > list of Y's as they stand, without processing all of the data. This is
> > even if most of the terms that hashed to H(T) were ones we weren't
> > interested in; in other words, if ANY of the terms are extremely common,
> > even if we aren't interested in them, we run the risk of being unable to
> > process all of the data.
> > Please let me know if I'm off the mark on any of that :-)
> > So my idea is as follows: In programming world, when we are processing a
> > large amount of data and we fill up our current write buffer, we just
> > output that buffer, make a new one, and continue. Why wouldn't that work
> > here? In other words, when we encounter a term (T, D) that would cause
> > of the counters c_H(T) to exceed r, why couldn't we just push the current
> > Y_0, ..., Y_(r-1) entries onto a list, create a new set of entries Y'_0,
> > ..., Y'_(r-1), reset all the counters to 0, and continue processing (T,D)
> > as normal? Then the algorithm could return all of the buffers Y, Y', Y'',
> > etc. that it created instead of just Y, and we can ensure that we can
> > process all of the data.
> > As far as I could tell, this requires giving the responder no additional
> > information about our query terms that they didn't already have. So I
> > think it would weaken the privacy of the algorithm at all, but please
> > correct me if I'm missing something. On the other hand I think it would
> > make PIRK much easier to use, since we don't run the risk of a query
> > failing if we don't understand enough about the data it's running over to
> > be able to tune r appropriately. I can also imagine this leading to some
> > performance improvements, since we could let PIRK choose r automatically
> > that a result set could fit nicely within a hard disk page, network
> > etc.
> > What do you guys think?
> > Regards,
> > -Ryan Carr