Hi Tim, More answers inline below.
On Thu, Jul 21, 2016 at 10:47 AM, Tim Ellison <[email protected]> wrote: > On 20/07/16 23:02, Ellison Anne Williams wrote: > > At the risk of making things too nested (and hence difficult to read), I > > will answer inline below. > > Inlining responses is the only sane way to go ;-) > > See below > > > On Tue, Jul 19, 2016 at 6:22 AM, Tim Ellison <[email protected]> > wrote: > > > >> I'm trying to write the simplest of examples using Pirk, so I can > >> understand what is happening, and I'm stumbling a bit in some of the > >> assumptions and side effects ... > >> > >> I have a simple data schema > >> https://paste.apache.org/TcxK > >> > >> describing my data file > >> https://paste.apache.org/8QDH > >> > >> The query schema is > >> https://paste.apache.org/gV02 > >> > >> And finally, here's my first attempt to create a query/querier > >> https://paste.apache.org/1IpL > >> > >> > >> > > EAW: These look good ;) > > I've moved the code into Github just to make sharing it a bit easier. > It's not a fully working project, pretty, etc. just there to show what > I'm trying to do. > > https://github.com/tellison/pirk-example > > EAW: Very nice -- looks like there is a secret interest in Bob's age :) > Seems to me that a number of the parameters to the QueryInfo could take > sensible defaults? then I can drop all the statics that I have little > clue what they are doing... > private static int keyedHashBitSize = 12; > private static String hashedkey = "SomeKey"; > private static int dataPartitionBitSize = 8; > private static int paillierBitSize = 384; > private static int certainty = 128; > > EAW: As discussed a bit before, I am in favor of multiple properties files. The properties listed above, which are not all currently in pirk.properties, could be moved to a querier.properties file (or something similar) and defaulted there, if desired. > >> Observations and questions so far: > >> > >> * I've had a stab at defining the xsd's for the schema files to help me > >> verify them. There is a PR in the queue for you to take a look at to > >> see if I got it right. > >> > > EAW: These are spot on - thanks for putting these together. > > > >> * It seems I must put the schemas into a file. It would be useful to > >> have an API to define the schema directly. > >> > >> - I can see why the data schema is likely to be fixed, and therefore > >> not unusual to be in a file, but for ad hoc queries I'm assuming I may > >> want to just send the query schema alongside the Query to the responder? > >> > >> > > EAW: Absolutely. At some point (soon), I think that we should roll this > > into the code - I am in favor of opening a JIRA issue to optionally > > incorporate the query schema (QuerySchema object) directly into the Query > > (probably via QueryInfo, although there are several ways to effectively > > accomplish this) instead of relying on it being present in the Responder > > (right now, the Responder's filesystem, local and/or hdfs). This would > > allow for ad-hoc query schemas to be run without 'pre-coordination' > between > > the Querier and Responder. The only downside is that it may be redundant > -- > > if every Query is sending the same query schema to a given Responder, > it's > > a bit wasteful (but, in my opinion, insignificantly so compared to the > size > > of a decent query). > > Agreed, I don't know if the query schema could get large, but I would > assume that the ability to send it with the query is good -- and maybe > an option to refer to an existing file-based schema for those cases > where it is large, or is well-defined in advance. > I can envisage how that would work. > EAW: Agreed. When we optionally embed the QuerySchema in the Query, we need to also give the Responder the ability to reject query schemas that it doesn't know about a priori (back to the trust model). We can do a simple boolean 'accept/reject unknown query schema' as a Responder config option at this point and evolve it down the line (this could stay simple or naturally become quite complex). I will make a JIRA issue to this effect and we can start working on it. This will be relatively straightforward (I'll take the initial responsibility). > > > Further, you could optionally send both the query and data schema along > in > > the Query object. Note that there is far less value in sending the data > > schema in this manner as the Responder and Querier should already be 'in > > sync' on the data - i.e., the Query is dependent on the data schema and > can > > only be successfully performed by the Responder if the Responder has data > > with that schema. > > Right, sending the data schema to the responder doesn't seem natural. > If anything I'd want to ask the responder what the schema is. > EAW: Agreed. > > >> * My data is in JSON format, but the schema is in XML - would be useful > >> to be able to specify the schema in a variety of formats, e.g. > >> json-schema for JSON data. > >> > >> - Maybe this is one area where the schema provider can be more > >> flexible. > > > > EAW: Yes, the schema provider could be more flexible. XML was just a > quick > > and relatively standard way to get up and running. In LoadDataSchemas, > the > > XML is parsed to create a DataSchema object (one for each data schema > > specified). The DataSchema object could be created from other sources. > > Cool, so the plan would be to make the DataSchema class agnostic of its > potential persistent representation, then have 'providers' that > applications can use to load/store in a variety of formats. > EAW: Agreed. I will add a JIRA issue to this effect so we can put it concretely on the map. > > >> * My first touch of the SystemConfiguration class (line: 31) causes an > >> attempt to read the schemas [1] before I get a chance to set the > >> required properties. I am calling #initialize on the loaders again to > >> do the actual work. > > > >> - Why does SystemConfiguration<clinit> load the query schemas before > >> it can have any properties set? > >> - Would be helpful to have a API to define additional schema > >> incrementally at runtime. At the moment, I assume I must call > >> LoadQuerySchemas#getSchemaMap() and manipulate the map directly [2]. > >> > > EAW: Currently, the assumption throughout the codebase is that data and > > query schemas are 'statically' specified via the corresponding XML files. > > We definitely can (and should!) change this to be more dynamic. There are > > several ways to do this, but probably the easiest within the current > > LoadQuerySchema construct is to add the ability to add a QuerySchema to > the > > schemaMap at any point. For me, this ties in with providing the ability > to > > specify 'ad-hoc' query schemas via the Query object. > > Yep. > EAW: This is easy to do (allow the ability to add a QuerySchema to the LoadQuerySchema schemaMap at any point); I'll create the JIRA issue. > > > The query and data schema XML files are specified via the query.schemas > and > > data.schemas properties in the pirk.properties file. Before the schemas > can > > be loaded (via LoadQuerySchema and LoadDataSchema), the properties have > to > > be parsed from the pirk.properties file via SystemConfiguration. Thus, > > given the current setup, the most natural thing to do seemed to be to > > statically 'initialize' SystemConfiguration, loading the properties and > > then immediately loading the query and data schemas (since they are read > > from the properties and do not change). > > > > In your code, you didn't specify (at least that I can see) your schema > > files in your copy of pirk.properties, thus they would not be loaded > > automatically via the SystemConfiguration static code block. Instead, you > > (very logically!) explicitly set the data.schemas and query.schemas > > property with two calls to SystemConfiguration.setProperty(<schemaProp>, > > <schemaFile>). This necessitated extra calls to > > LoadDataSchemas.initialize() and LoadQuerySchemas.initialize() in order > to > > have the schemas loaded. > > Yep, I guess I had a different mental model of the parts that are static > (defined once per long running instance of Pirk), and those that are > part of each query/response interaction between various queriers and > responders. > > > As to [2], the trust model for Pirk, that's an entirely separate thread > > that we can start at some point, if desired. :) > > Ha, let's park that discussion for the moment :-) > > >> * The order of loading the schemas is important, must load the data > >> schemas before the query schemas (as there is a back reference that is > >> checked at load time), so it becomes > >> SystemConfiguration.setProperty("data.schemas", "..."); > >> SystemConfiguration.setProperty("query.schemas", "..."); > >> LoadDataSchemas.initialize(); > >> LoadQuerySchemas.initialize(); > >> > > EAW: Correct - query schemas are dependent on data schemas. Right now, a > > query schema can only specify a single data schema over which it can > > operate. We can add the option for a Query (via its query schema) to run > > over multiple data types (specify multiple data schemas) at one time. > > I wasn't thinking about the use of multi data schema, rather wondering > if the query schema needs to be eager about having the data schema > available - maybe it can defer it until the query is created. I may be > over thinking it, and it's not a big deal. > EAW: Right now, the 'enforcing' behavior of the query-data schema dependencies is 'universal' to both the Querier and Responder. The Querier uses the data schema to obtain the bit size of the selector, based on its type and the specified partitioner, as it would be partitioned by the Responder. This value dictates whether we embed the actual selector (obtaining, BTW, a 0% false positive rate) or whether we embed its hash (which can incur a slight, calculable false positive rate). > > >> * Now I create the QueryInfo object. No idea what a number of these > >> parameters are doing ;-) but they do seem to relate to the core > >> function, plus parts of the Paillier algorithm and Hadoop integration > >> too (just wondering if they should be there or kept elsewhere?). > >> > >> - If the QueyInfo is API then it needs more user level doc. > >> > >> I've not tried running the query yet! Just the first baby steps, so > >> stop me if I'm heading in the wrong direction. > > > > EAW: You are heading in the right direction. As to running the query - go > > for it! Happy to help with any issues you run into :) > > I ran into a couple of gottchas. > - the layout of the data file assumes one JSON definition per line; > - setting the "pir.outputData" config in my app had no effect > (Did I mention that I'm falling out with the SystemConfiguration ;-) > > EAW: Yes, I can see that you love SystemConfiguration. :) (yes, we can change it...) > Well I got a response! I've checked the debug info into my scratch repo > for you to see > https://github.com/tellison/pirk-example/tree/master/debug EAW: w00t! :) > > > My next step is to try and tidy up a bit, and suggest some Pirk API > changes to make this a very simple usage example, from there I'll slowly > move up the stack to try an example on Spark. > > EAW: Sound great. > Regards, > Tim > > >> [1] > >> > >> > https://github.com/apache/incubator-pirk/blob/master/src/main/java/org/apache/pirk/utils/SystemConfiguration.java#L73 > >> [2] At some point I'd like to understand the trust model for Pirk, i.e. > >> where the boundary is between trusted/untrusted code. > >> > >> I appreciate your patience! > >> > >> Regards, > >> Tim > >> > >> > >> > >> > > >
