Re: Evolving the simple example

Ellison Anne Williams Wed, 20 Jul 2016 15:03:07 -0700

Hi Tim,

At the risk of making things too nested (and hence difficult to read), I
will answer inline below.

On Tue, Jul 19, 2016 at 6:22 AM, Tim Ellison <[email protected]> wrote:

> I'm trying to write the simplest of examples using Pirk, so I can
> understand what is happening, and I'm stumbling a bit in some of the
> assumptions and side effects ...
>
> I have a simple data schema
>         https://paste.apache.org/TcxK
>
> describing my data file
>         https://paste.apache.org/8QDH
>
> The query schema is
>         https://paste.apache.org/gV02
>
> And finally, here's my first attempt to create a query/querier
>         https://paste.apache.org/1IpL
>
>
>
EAW: These look good ;)

> Observations and questions so far:
>
>  * I've had a stab at defining the xsd's for the schema files to help me
> verify them.  There is a PR in the queue for you to take a look at to
> see if I got it right.
>
>
>
EAW: These are spot on - thanks for putting these together.

>  * It seems I must put the schemas into a file.  It would be useful to
> have an API to define the schema directly.
>
>     - I can see why the data schema is likely to be fixed, and therefore
> not unusual to be in a file, but for ad hoc queries I'm assuming I may
> want to just send the query schema alongside the Query to the responder?
>
>
EAW: Absolutely. At some point (soon), I think that we should roll this
into the code - I am in favor of opening a JIRA issue to optionally
incorporate the query schema (QuerySchema object) directly into the Query
(probably via QueryInfo, although there are several ways to effectively
accomplish this) instead of relying on it being present in the Responder
(right now, the Responder's filesystem, local and/or hdfs). This would
allow for ad-hoc query schemas to be run without 'pre-coordination' between
the Querier and Responder. The only downside is that it may be redundant --
if every Query is sending the same query schema to a given Responder, it's
a bit wasteful (but, in my opinion, insignificantly so compared to the size
of a decent query).

Further, you could optionally send both the query and data schema along in
the Query object. Note that there is far less value in sending the data
schema in this manner as the Responder and Querier should already be 'in
sync' on the data - i.e., the Query is dependent on the data schema and can
only be successfully performed by the Responder if the Responder has data
with that schema.

>
>  * My data is in JSON format, but the schema is in XML - would be useful
> to be able to specify the schema in a variety of formats, e.g.
> json-schema for JSON data.
>
>     - Maybe this is one area where the schema provider can be more
> flexible.
>
>

EAW: Yes, the schema provider could be more flexible. XML was just a quick
and relatively standard way to get up and running. In LoadDataSchemas, the
XML is parsed to create a DataSchema object (one for each data schema
specified). The DataSchema object could be created from other sources.

>
>  * My first touch of the SystemConfiguration class (line: 31) causes an
> attempt to read the schemas [1] before I get a chance to set the
> required properties.  I am calling #initialize on the loaders again to
> do the actual work.

>   - Why does SystemConfiguration<clinit> load the query schemas before
> it can have any properties set?
>   - Would be helpful to have a API to define additional schema
> incrementally at runtime.  At the moment, I assume I must call
> LoadQuerySchemas#getSchemaMap() and manipulate the map directly [2].
>
>
EAW: Currently, the assumption throughout the codebase is that data and
query schemas are 'statically' specified via the corresponding XML files.
We definitely can (and should!) change this to be more dynamic. There are
several ways to do this, but probably the easiest within the current
LoadQuerySchema construct is to add the ability to add a QuerySchema to the
schemaMap at any point. For me, this ties in with providing the ability to
specify 'ad-hoc' query schemas via the Query object.

The query and data schema XML files are specified via the query.schemas and
data.schemas properties in the pirk.properties file. Before the schemas can
be loaded (via LoadQuerySchema and LoadDataSchema), the properties have to
be parsed from the pirk.properties file via SystemConfiguration. Thus,
given the current setup, the most natural thing to do seemed to be to
statically 'initialize' SystemConfiguration, loading the properties and
then immediately loading the query and data schemas (since they are read
from the properties and do not change).

In your code, you didn't specify (at least that I can see) your schema
files in your copy of pirk.properties, thus they would not be loaded
automatically via the SystemConfiguration static code block. Instead, you
(very logically!) explicitly set the data.schemas and query.schemas
property with two calls to SystemConfiguration.setProperty(<schemaProp>,
<schemaFile>). This necessitated extra calls to
LoadDataSchemas.initialize() and LoadQuerySchemas.initialize() in order to
have the schemas loaded.

As to [2], the trust model for Pirk, that's an entirely separate thread
that we can start at some point, if desired. :)

>  * The order of loading the schemas is important, must load the data
> schemas before the query schemas (as there is a back reference that is
> checked at load time), so it becomes
>     SystemConfiguration.setProperty("data.schemas", "...");
>     SystemConfiguration.setProperty("query.schemas", "...");
>     LoadDataSchemas.initialize();
>     LoadQuerySchemas.initialize();
>
>
>
EAW: Correct - query schemas are dependent on data schemas. Right now, a
query schema can only specify a single data schema over which it can
operate. We can add the option for a Query (via its query schema) to run
over multiple data types (specify multiple data schemas) at one time.

>  * Now I create the QueryInfo object.  No idea what a number of these
> parameters are doing ;-)  but they do seem to relate to the core
> function, plus parts of the Paillier algorithm and Hadoop integration
> too (just wondering if they should be there or kept elsewhere?).
>
>   - If the QueyInfo is API then it needs more user level doc.
>
>
> I've not tried running the query yet!  Just the first baby steps, so
> stop me if I'm heading in the wrong direction.
>

EAW: You are heading in the right direction. As to running the query - go
for it! Happy to help with any issues you run into :)

>
> [1]
>
> https://github.com/apache/incubator-pirk/blob/master/src/main/java/org/apache/pirk/utils/SystemConfiguration.java#L73
> [2] At some point I'd like to understand the trust model for Pirk, i.e.
> where the boundary is between trusted/untrusted code.
>
> I appreciate your patience!
>
> Regards,
> Tim
>
>
>
>

Re: Evolving the simple example

Reply via email to