Hi Tim, At the risk of making things too nested (and hence difficult to read), I will answer inline below.
On Tue, Jul 19, 2016 at 6:22 AM, Tim Ellison <[email protected]> wrote: > I'm trying to write the simplest of examples using Pirk, so I can > understand what is happening, and I'm stumbling a bit in some of the > assumptions and side effects ... > > I have a simple data schema > https://paste.apache.org/TcxK > > describing my data file > https://paste.apache.org/8QDH > > The query schema is > https://paste.apache.org/gV02 > > And finally, here's my first attempt to create a query/querier > https://paste.apache.org/1IpL > > > EAW: These look good ;) > Observations and questions so far: > > * I've had a stab at defining the xsd's for the schema files to help me > verify them. There is a PR in the queue for you to take a look at to > see if I got it right. > > > EAW: These are spot on - thanks for putting these together. > * It seems I must put the schemas into a file. It would be useful to > have an API to define the schema directly. > > - I can see why the data schema is likely to be fixed, and therefore > not unusual to be in a file, but for ad hoc queries I'm assuming I may > want to just send the query schema alongside the Query to the responder? > > EAW: Absolutely. At some point (soon), I think that we should roll this into the code - I am in favor of opening a JIRA issue to optionally incorporate the query schema (QuerySchema object) directly into the Query (probably via QueryInfo, although there are several ways to effectively accomplish this) instead of relying on it being present in the Responder (right now, the Responder's filesystem, local and/or hdfs). This would allow for ad-hoc query schemas to be run without 'pre-coordination' between the Querier and Responder. The only downside is that it may be redundant -- if every Query is sending the same query schema to a given Responder, it's a bit wasteful (but, in my opinion, insignificantly so compared to the size of a decent query). Further, you could optionally send both the query and data schema along in the Query object. Note that there is far less value in sending the data schema in this manner as the Responder and Querier should already be 'in sync' on the data - i.e., the Query is dependent on the data schema and can only be successfully performed by the Responder if the Responder has data with that schema. > > * My data is in JSON format, but the schema is in XML - would be useful > to be able to specify the schema in a variety of formats, e.g. > json-schema for JSON data. > > - Maybe this is one area where the schema provider can be more > flexible. > > EAW: Yes, the schema provider could be more flexible. XML was just a quick and relatively standard way to get up and running. In LoadDataSchemas, the XML is parsed to create a DataSchema object (one for each data schema specified). The DataSchema object could be created from other sources. > > * My first touch of the SystemConfiguration class (line: 31) causes an > attempt to read the schemas [1] before I get a chance to set the > required properties. I am calling #initialize on the loaders again to > do the actual work. > - Why does SystemConfiguration<clinit> load the query schemas before > it can have any properties set? > - Would be helpful to have a API to define additional schema > incrementally at runtime. At the moment, I assume I must call > LoadQuerySchemas#getSchemaMap() and manipulate the map directly [2]. > > EAW: Currently, the assumption throughout the codebase is that data and query schemas are 'statically' specified via the corresponding XML files. We definitely can (and should!) change this to be more dynamic. There are several ways to do this, but probably the easiest within the current LoadQuerySchema construct is to add the ability to add a QuerySchema to the schemaMap at any point. For me, this ties in with providing the ability to specify 'ad-hoc' query schemas via the Query object. The query and data schema XML files are specified via the query.schemas and data.schemas properties in the pirk.properties file. Before the schemas can be loaded (via LoadQuerySchema and LoadDataSchema), the properties have to be parsed from the pirk.properties file via SystemConfiguration. Thus, given the current setup, the most natural thing to do seemed to be to statically 'initialize' SystemConfiguration, loading the properties and then immediately loading the query and data schemas (since they are read from the properties and do not change). In your code, you didn't specify (at least that I can see) your schema files in your copy of pirk.properties, thus they would not be loaded automatically via the SystemConfiguration static code block. Instead, you (very logically!) explicitly set the data.schemas and query.schemas property with two calls to SystemConfiguration.setProperty(<schemaProp>, <schemaFile>). This necessitated extra calls to LoadDataSchemas.initialize() and LoadQuerySchemas.initialize() in order to have the schemas loaded. As to [2], the trust model for Pirk, that's an entirely separate thread that we can start at some point, if desired. :) > * The order of loading the schemas is important, must load the data > schemas before the query schemas (as there is a back reference that is > checked at load time), so it becomes > SystemConfiguration.setProperty("data.schemas", "..."); > SystemConfiguration.setProperty("query.schemas", "..."); > LoadDataSchemas.initialize(); > LoadQuerySchemas.initialize(); > > > EAW: Correct - query schemas are dependent on data schemas. Right now, a query schema can only specify a single data schema over which it can operate. We can add the option for a Query (via its query schema) to run over multiple data types (specify multiple data schemas) at one time. > * Now I create the QueryInfo object. No idea what a number of these > parameters are doing ;-) but they do seem to relate to the core > function, plus parts of the Paillier algorithm and Hadoop integration > too (just wondering if they should be there or kept elsewhere?). > > - If the QueyInfo is API then it needs more user level doc. > > > I've not tried running the query yet! Just the first baby steps, so > stop me if I'm heading in the wrong direction. > EAW: You are heading in the right direction. As to running the query - go for it! Happy to help with any issues you run into :) > > [1] > > https://github.com/apache/incubator-pirk/blob/master/src/main/java/org/apache/pirk/utils/SystemConfiguration.java#L73 > [2] At some point I'd like to understand the trust model for Pirk, i.e. > where the boundary is between trusted/untrusted code. > > I appreciate your patience! > > Regards, > Tim > > > >
