[jira] [Commented] (AVRO-923) Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration

Julien Muller (Commented) (JIRA) Wed, 12 Oct 2011 04:09:41 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125714#comment-13125714
 ]


Julien Muller commented on AVRO-923:
------------------------------------

Answers to the previous comment:

- It's slightly riskier to get the schema from the runtime than from the job
> This is correct, but it seems to me this risk is already taken for other 
> parameters such as "avro.mapper". For the case of schemas though there is a 
> second check that occurs when the input file schema does not match the 
> compiled schema.

- not all schemas have a class that defines them
> If the schema is a primitive type (e.g. long or string), I don't see any 
> value in using the proposed mechanism. It seems to me this feature would only 
> apply to complex schemas that may be updated regularly. If the schema is a 
> Pair based on simple or complex type, we would still be able to generate the 
> associated Avro bean. Not sure about the usual usage.

- Why is it so painful to insert the schema there as a part of your job
> Let's say you have a schema used in 100 different jobs in 20 workflows, 
> changing a field to nullable implies a modification of all these workflows 
> and test runs with a risk of copy / paste error. As the schema is not human 
> readable (compared to a class name), it is hard to identify all the places 
> where your schema is used (and what version). We encountered this about 3 
> times over a 6 month period. If we were using programmatically constructed 
> JobConf, this would be a simple recompilation of the jobs.
A side effect is that we have to maintain specifications of our workflows, 
where the flows would be self explainable.

- A JobConf can be programmatically constructed
This is totally correct. It can also be described with xml files, and the all 
point is to improve the support of this second case. When using Avro part of a 
global solution, together with hadoop and Oozie, we can have separate 
responsibilities, developers implement Business Objects (Avro) and MapReduce, 
and architects design the workflows pipelines in xml files.

- any added mechanism has the potential to create bugs and user confusion
I try to address user confusion by allowing the usage of "avro.input.schema", 
and falling back to use "avro.input.class". A way to improve this would be to 
put this mechanism behind the scene, and add an additional signature to 
setInputSchema(JobConf job, Class c).
This still would need to be improved to something like:
setInputSchema(JobConf job, Class<? extends SpecificRecord> c), but getSchema() 
is an instance method and there is no simple way to ensure the SCHEMA$ field 
would be present. 

Another approach would be to drop entirely having to set the schema in the xml 
configuration: the AvroMapper knows the input schema as it is compiled with it, 
the RecordReader knows the schema of the underlying data. If there should be a 
match, it should be matching these two, instead of matching with an external 
schema string. Not sure if there is a technical limitation to this approach.
                
> Avro-MapRed: Provide a fallback using avro beans instead of schema in job 
> configuration
> ---------------------------------------------------------------------------------------
>
>                 Key: AVRO-923
>                 URL: https://issues.apache.org/jira/browse/AVRO-923
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.4
>         Environment: any
>            Reporter: Julien Muller
>             Fix For: 1.6.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The current implementation of Avro MapRed is designed to use JobConf. While 
> it is possible to use job.xml file, it is pretty painful since you have to 
> copy/paste the all schemes for input and output. This is error prone and time 
> consuming. Also any update in a bean requires to recopy/repaste the schema 
> (if using JobConf a simple recompile would be enough).
> A proposition to improve this and to stay backward compatible would be to 
> introduce new keys in AvroJob and reference the actual avro bean used. This 
> can be implemented as a fallback.
> New keys would be created:
> - avro.input.schema > avro.input.class
> - avro.map.output.schema > avro.map.output.class
> - avro.output.schema > avro.output.class
> Only 3 methods would be impacted in AvroJob:
> - getInputSchema(Configuration job) {
>       // Implement a fallback like
>       String s = job.get(INPUT_SCHEMA);
>       if(s==null) s = 
> (String)Class.forName(job.get(INPUT_CLASS)).getDeclaredField("SCHEMA$").get(null);
>           return Schema.parse(s);
>       }
>   }
> - getMapOutputSchema()
> - getOutputSchema()
> Also, it would be more consistent to add new setters. This is not mandatory 
> since in that use case, the new keys are filled up directly in the job, not 
> using AvroJob. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-923) Avro-MapRed: Provide a fallback using avro beans instead of schema in job configuration

Reply via email to