Hello all, I've been working through the Wideskies paper and Walter's slides. I think I have an idea that would improve usability by a lot, but I want to run it past you guys to see if it would work.
Here's my understanding of how the Wideskies Responder works: - It maintains the values Y_0, ..., Y_(r-1) which hold the encrypted data returned by the algorithm. From a Comp Sci perspective, I see the collection of these values functioning as, basically, a big encrypted byte buffer holding our results. It also maintains a list of counters c_0, ..., c_t, one for each possible value of hash function H. - When a pair (T, D) comes in, we hash T to get H(T), and break D into 8-bit chunks. We then combine each of those chunks into our Y_i values using E_H(T), our encrypted value for the hash of T, which is chosen in such a way that a) if we aren't interested in T, adding this chunk to Y_i does not change the value it decrypts to (we essentially multiply by encrypted 1); or b) if we are interested in T, E_H(T) has been chosen in such a way that adding this chunk to Y_i will not overwrite any byte chunks for *other* interesting terms that have also been written to Y_i. - We then increment the counter c_T so that future items that hash to the same value write to different Y_i, so that we do not overwrite the value we just wrote. - If *any *of these counters c_H(T) would exceed r due to some new pair (T,D), then we can no longer ensure that we won't overwrite data we already saw when we add the new D to Y_0, ..., Y_(r-1). From a Comp Sci perspective, this looks like "our buffer is full," since we don't have room for new data. Therefore, we have to terminate the algorithm and return our list of Y's as they stand, without processing all of the data. This is true even if most of the terms that hashed to H(T) were ones we weren't interested in; in other words, if ANY of the terms are extremely common, even if we aren't interested in them, we run the risk of being unable to process all of the data. Please let me know if I'm off the mark on any of that :-) So my idea is as follows: In programming world, when we are processing a large amount of data and we fill up our current write buffer, we just output that buffer, make a new one, and continue. Why wouldn't that work here? In other words, when we encounter a term (T, D) that would cause one of the counters c_H(T) to exceed r, why couldn't we just push the current Y_0, ..., Y_(r-1) entries onto a list, create a new set of entries Y'_0, ..., Y'_(r-1), reset all the counters to 0, and continue processing (T,D) as normal? Then the algorithm could return all of the buffers Y, Y', Y'', etc. that it created instead of just Y, and we can ensure that we can process all of the data. As far as I could tell, this requires giving the responder no additional information about our query terms that they didn't already have. So I don't think it would weaken the privacy of the algorithm at all, but please correct me if I'm missing something. On the other hand I think it would make PIRK much easier to use, since we don't run the risk of a query failing if we don't understand enough about the data it's running over to be able to tune r appropriately. I can also imagine this leading to some performance improvements, since we could let PIRK choose r automatically so that a result set could fit nicely within a hard disk page, network frame, etc. What do you guys think? Regards, -Ryan Carr