Re: Evolving the simple example

Tim Ellison Thu, 21 Jul 2016 07:47:58 -0700

On 20/07/16 23:02, Ellison Anne Williams wrote:
> At the risk of making things too nested (and hence difficult to read), I
> will answer inline below.


Inlining responses is the only sane way to go ;-)

See below

> On Tue, Jul 19, 2016 at 6:22 AM, Tim Ellison <[email protected]> wrote:
> 
>> I'm trying to write the simplest of examples using Pirk, so I can
>> understand what is happening, and I'm stumbling a bit in some of the
>> assumptions and side effects ...
>>
>> I have a simple data schema
>>         https://paste.apache.org/TcxK
>>
>> describing my data file
>>         https://paste.apache.org/8QDH
>>
>> The query schema is
>>         https://paste.apache.org/gV02
>>
>> And finally, here's my first attempt to create a query/querier
>>         https://paste.apache.org/1IpL
>>
>>
>>
> EAW: These look good ;)

I've moved the code into Github just to make sharing it a bit easier.
It's not a fully working project, pretty, etc.  just there to show what
I'm trying to do.

https://github.com/tellison/pirk-example

Seems to me that a number of the parameters to the QueryInfo could take
sensible defaults?  then I can drop all the statics that I have little
clue what they are doing...
    private static int keyedHashBitSize = 12;
    private static String hashedkey = "SomeKey";
    private static int dataPartitionBitSize = 8;
    private static int paillierBitSize = 384;
    private static int certainty = 128;

>> Observations and questions so far:
>>
>>  * I've had a stab at defining the xsd's for the schema files to help me
>> verify them.  There is a PR in the queue for you to take a look at to
>> see if I got it right.
>>
> EAW: These are spot on - thanks for putting these together.
> 
>>  * It seems I must put the schemas into a file.  It would be useful to
>> have an API to define the schema directly.
>>
>>     - I can see why the data schema is likely to be fixed, and therefore
>> not unusual to be in a file, but for ad hoc queries I'm assuming I may
>> want to just send the query schema alongside the Query to the responder?
>>
>>
> EAW: Absolutely. At some point (soon), I think that we should roll this
> into the code - I am in favor of opening a JIRA issue to optionally
> incorporate the query schema (QuerySchema object) directly into the Query
> (probably via QueryInfo, although there are several ways to effectively
> accomplish this) instead of relying on it being present in the Responder
> (right now, the Responder's filesystem, local and/or hdfs). This would
> allow for ad-hoc query schemas to be run without 'pre-coordination' between
> the Querier and Responder. The only downside is that it may be redundant --
> if every Query is sending the same query schema to a given Responder, it's
> a bit wasteful (but, in my opinion, insignificantly so compared to the size
> of a decent query).

Agreed, I don't know if the query schema could get large, but I would
assume that the ability to send it with the query is good -- and maybe
an option to refer to an existing file-based schema for those cases
where it is large, or is well-defined in advance.
I can envisage how that would work.

> Further, you could optionally send both the query and data schema along in
> the Query object. Note that there is far less value in sending the data
> schema in this manner as the Responder and Querier should already be 'in
> sync' on the data - i.e., the Query is dependent on the data schema and can
> only be successfully performed by the Responder if the Responder has data
> with that schema.

Right, sending the data schema to the responder doesn't seem natural.
If anything I'd want to ask the responder what the schema is.

>>  * My data is in JSON format, but the schema is in XML - would be useful
>> to be able to specify the schema in a variety of formats, e.g.
>> json-schema for JSON data.
>>
>>     - Maybe this is one area where the schema provider can be more
>> flexible.
> 
> EAW: Yes, the schema provider could be more flexible. XML was just a quick
> and relatively standard way to get up and running. In LoadDataSchemas, the
> XML is parsed to create a DataSchema object (one for each data schema
> specified). The DataSchema object could be created from other sources.

Cool, so the plan would be to make the DataSchema class agnostic of its
potential persistent representation, then have 'providers' that
applications can use to load/store in a variety of formats.

>>  * My first touch of the SystemConfiguration class (line: 31) causes an
>> attempt to read the schemas [1] before I get a chance to set the
>> required properties.  I am calling #initialize on the loaders again to
>> do the actual work.
> 
>>   - Why does SystemConfiguration<clinit> load the query schemas before
>> it can have any properties set?
>>   - Would be helpful to have a API to define additional schema
>> incrementally at runtime.  At the moment, I assume I must call
>> LoadQuerySchemas#getSchemaMap() and manipulate the map directly [2].
>>
> EAW: Currently, the assumption throughout the codebase is that data and
> query schemas are 'statically' specified via the corresponding XML files.
> We definitely can (and should!) change this to be more dynamic. There are
> several ways to do this, but probably the easiest within the current
> LoadQuerySchema construct is to add the ability to add a QuerySchema to the
> schemaMap at any point. For me, this ties in with providing the ability to
> specify 'ad-hoc' query schemas via the Query object.

Yep.

> The query and data schema XML files are specified via the query.schemas and
> data.schemas properties in the pirk.properties file. Before the schemas can
> be loaded (via LoadQuerySchema and LoadDataSchema), the properties have to
> be parsed from the pirk.properties file via SystemConfiguration. Thus,
> given the current setup, the most natural thing to do seemed to be to
> statically 'initialize' SystemConfiguration, loading the properties and
> then immediately loading the query and data schemas (since they are read
> from the properties and do not change).
> 
> In your code, you didn't specify (at least that I can see) your schema
> files in your copy of pirk.properties, thus they would not be loaded
> automatically via the SystemConfiguration static code block. Instead, you
> (very logically!) explicitly set the data.schemas and query.schemas
> property with two calls to SystemConfiguration.setProperty(<schemaProp>,
> <schemaFile>). This necessitated extra calls to
> LoadDataSchemas.initialize() and LoadQuerySchemas.initialize() in order to
> have the schemas loaded.

Yep, I guess I had a different mental model of the parts that are static
(defined once per long running instance of Pirk), and those that are
part of each query/response interaction between various queriers and
responders.

> As to [2], the trust model for Pirk, that's an entirely separate thread
> that we can start at some point, if desired. :)

Ha, let's park that discussion for the moment :-)

>>  * The order of loading the schemas is important, must load the data
>> schemas before the query schemas (as there is a back reference that is
>> checked at load time), so it becomes
>>     SystemConfiguration.setProperty("data.schemas", "...");
>>     SystemConfiguration.setProperty("query.schemas", "...");
>>     LoadDataSchemas.initialize();
>>     LoadQuerySchemas.initialize();
>>
> EAW: Correct - query schemas are dependent on data schemas. Right now, a
> query schema can only specify a single data schema over which it can
> operate. We can add the option for a Query (via its query schema) to run
> over multiple data types (specify multiple data schemas) at one time.

I wasn't thinking about the use of multi data schema, rather wondering
if the query schema needs to be eager about having the data schema
available - maybe it can defer it until the query is created.  I may be
over thinking it, and it's not a big deal.

>>  * Now I create the QueryInfo object.  No idea what a number of these
>> parameters are doing ;-)  but they do seem to relate to the core
>> function, plus parts of the Paillier algorithm and Hadoop integration
>> too (just wondering if they should be there or kept elsewhere?).
>>
>>   - If the QueyInfo is API then it needs more user level doc.
>>
>> I've not tried running the query yet!  Just the first baby steps, so
>> stop me if I'm heading in the wrong direction.
> 
> EAW: You are heading in the right direction. As to running the query - go
> for it! Happy to help with any issues you run into :)

I ran into a couple of gottchas.
 - the layout of the data file assumes one JSON definition per line;
 - setting the "pir.outputData" config in my app had no effect
   (Did I mention that I'm falling out with the SystemConfiguration ;-)

Well I got a response!  I've checked the debug info into my scratch repo
for you to see
https://github.com/tellison/pirk-example/tree/master/debug

My next step is to try and tidy up a bit, and suggest some Pirk API
changes to make this a very simple usage example, from there I'll slowly
move up the stack to try an example on Spark.

Regards,
Tim

>> [1]
>>
>> https://github.com/apache/incubator-pirk/blob/master/src/main/java/org/apache/pirk/utils/SystemConfiguration.java#L73
>> [2] At some point I'd like to understand the trust model for Pirk, i.e.
>> where the boundary is between trusted/untrusted code.
>>
>> I appreciate your patience!
>>
>> Regards,
>> Tim
>>
>>
>>
>>
>

Re: Evolving the simple example

Reply via email to