[jira] [Updated] (AVRO-1268) Add java-class, java-key-class and java-element-class support for stringable types to SpecificData
[ https://issues.apache.org/jira/browse/AVRO-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Normand updated AVRO-1268: Attachment: AVRO-1268.patch Thanks for the comments Doug. I wasn't thinking of the API compatibility but I revisited the changes after reading your comments and: * I settled on leaving the addStringable in ReflectData while having the default list of stringables in SpecificData. I think the use-case for SpecificData would be for one of the default JDK classes that are stringable by default or one could potentially write the schema to generate the {{@Stringable}} annotation. Using a SpecificData#addStringable is probably not a very common use case. * I brought the PROP constants in ReflectData but marked them as deprecated with the javadoc to redirected to SpecificData's constants. * As for the {{StringablesRecord}} not needing to be checked in, I'm missing something. The avdl files for that record are in {{java/compiler}} and {{java/maven-plugin}} while the test for the {{SpecificDatumReader}} is in {{java/avro}}. The generated class for that record won't show up in {{avro}} since {{compiler}} and {{maven-plugin}} both depend on {{avro}} to have built first. Also, the other specific record used in {{SpecificDatumReader}} is {{FooBarSpecificRecord}} and that one is also in the source. What am I missing? * I added a Specific test to {{Perf}} with the {{FooBarSpecificRecord}} (a good reference since that's the one that was there prior to any of my changes). I ran the modified {{Perf}} with and without my changes and it's not looking great: Before {code} Executing tests: [FooBarSpecificRecordTest] readTests:true writeTests:true cycles=800 test name timeM entries/sec M bytes/sec bytes/cycle FooBarSpecificRecordTestRead: 16086 ms 1.03660.413 1214805 FooBarSpecificRecordTestWrite: 8794 ms 1.895 110.505 1214805 {code} After: {code} Executing tests: [FooBarSpecificRecordTest] readTests:true writeTests:true cycles=800 test name timeM entries/sec M bytes/sec bytes/cycle FooBarSpecificRecordTestRead: 23937 ms 0.69640.600 1214805 FooBarSpecificRecordTestWrite: 11369 ms 1.46685.481 1214805 {code} I'm adding the latest patch with the changes above. It's possible that the performance hit could be less if I were to remove support for stringable array/map elements and stringable keys and just keep the {{@java-class}} support. I'd like not to do that though as this seems like a weird place to leave things. I would like if we could find a solution to mitigate the performance drop as I think this is still a desirable feature. Thoughts? > Add java-class, java-key-class and java-element-class support for stringable > types to SpecificData > -- > > Key: AVRO-1268 > URL: https://issues.apache.org/jira/browse/AVRO-1268 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.4 >Reporter: Alexandre Normand >Assignee: Alexandre Normand >Priority: Minor > Fix For: 1.7.5 > > Attachments: AVRO-1268.patch, AVRO-1268.patch > > > Stringable types are java classes that can be serialized through strings > (which require a single string constructor and a valid toString() > implementation). ReflectData currently has support from stringable types but > it would be desirable to get this feature with SpecificData. > The work involves changes to the SpecificCompiler (depends on {{@java-class}} > support in AVRO-1267) to generate the specific sources with the proper java > type as well as moving the ReflectDatumReader and ReflectDatumWriter to read > the java-class/java-key-class and java-element-class properties. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (AVRO-1268) Add java-class, java-key-class and java-element-class support for stringable types to SpecificData
[ https://issues.apache.org/jira/browse/AVRO-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595146#comment-13595146 ] Doug Cutting commented on AVRO-1268: Some quick thoughts on the patch: - ReflectData#addStringable's result type changes incompatibly. This might be fixable with generics. We technically can include incompatible API changes in a 1.8 release, but I'd prefer not to, as they make it hard for folks to update to the latest version of Avro. We might fix this by just using a different method name in SpecificData and deprecating but keeping the method in ReflectData. - Similarly, removing ReflectData.CLASS_PROP, etc. is an incompatible API change. We could just leave these constants where they are and use them from SpecificData, or we could deprecate them and add them to Specific, but we should not simply remove them if we want to maintain compatibility. - StringablesRecord.java shouldn't need to be checked in. This should be generated by the build in the target/ directory. - It might be good to check that performance isn't altered. We might modify Perf.java to be able to use SpecificDatumReader/Writer and see that this patch doesn't affect things adversely. > Add java-class, java-key-class and java-element-class support for stringable > types to SpecificData > -- > > Key: AVRO-1268 > URL: https://issues.apache.org/jira/browse/AVRO-1268 > Project: Avro > Issue Type: Improvement > Components: java >Affects Versions: 1.7.4 >Reporter: Alexandre Normand >Assignee: Alexandre Normand >Priority: Minor > Fix For: 1.7.5 > > Attachments: AVRO-1268.patch > > > Stringable types are java classes that can be serialized through strings > (which require a single string constructor and a valid toString() > implementation). ReflectData currently has support from stringable types but > it would be desirable to get this feature with SpecificData. > The work involves changes to the SpecificCompiler (depends on {{@java-class}} > support in AVRO-1267) to generate the specific sources with the proper java > type as well as moving the ReflectDatumReader and ReflectDatumWriter to read > the java-class/java-key-class and java-element-class properties. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (AVRO-1271) Hadoop Streaming mangles Python-produced Avro
[ https://issues.apache.org/jira/browse/AVRO-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Hanna resolved AVRO-1271. -- Resolution: Later Moving this to avro-users. > Hadoop Streaming mangles Python-produced Avro > - > > Key: AVRO-1271 > URL: https://issues.apache.org/jira/browse/AVRO-1271 > Project: Avro > Issue Type: Bug > Components: java, python >Affects Versions: 1.7.4 > Environment: Linux >Reporter: Alex Hanna > > I've got a rather simple script that takes Twitter data in JSON and turns it > into an Avro file. > from avro import schema, datafile, io > import json, sys > from types import * > def main(): > if len(sys.argv) < 2: > print "Usage: cat input.json | python2.7 JSONtoAvro.py output" > return > s = schema.parse(open("tweet.avsc").read()) > f = open(sys.argv[1], 'wb') > writer = datafile.DataFileWriter(f, io.DatumWriter(), s, codec = > 'deflate') > failed = 0 > for line in sys.stdin: > line = line.strip() > try: > data = json.loads(line) > except ValueError as detail: > continue > try: > writer.append(data) > except io.AvroTypeException as detail: > print line > failed += 1 > writer.close() > print str(failed) + " failed in schema" > if __name__ == '__main__': > main() > From there, I use this to feed a basic Hadoop Streaming script (also in > Python) which just pulls out certain elements of the tweets. However, when I > do this, it appears that the input for the script is mangled JSON. Usually > the JSON fails with some errant \u in the middle of the tweet body or > user-defined description. > The Streaming script is rather basic -- it reads from sys.stdin and attempts > to parse the JSON string using the json package. > Here is the bash script I use to invoke Hadoop Streaming: > jars=/usr/lib/hadoop/lib/avro-1.7.1.cloudera.2.jar,/usr/lib/hive/lib/avro-mapred-1.7.1.cloudera.2.jar > hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ > -files > $jars,$HOME/sandbox/hadoop/streaming/map/tweetMapper.py,$HOME/sandbox/hadoop/streaming/data/keywords.txt,$HOME/sandbox/hadoop/streaming/data/follow-r3.txt > \ > -libjars $jars \ > -input /user/ahanna/avrotest/avrotest.json.avro \ > -output output \ > -mapper "tweetMapper.py -a" \ > -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ > -inputformat org.apache.avro.mapred.AvroAsTextInputFormat \ > -numReduceTasks 1 > I'm starting to think this is a bug with > org.apache.avro.mapred.AvroAsTextInputFormat? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (AVRO-1271) Hadoop Streaming mangles Python-produced Avro
Alex Hanna created AVRO-1271: Summary: Hadoop Streaming mangles Python-produced Avro Key: AVRO-1271 URL: https://issues.apache.org/jira/browse/AVRO-1271 Project: Avro Issue Type: Bug Components: java, python Affects Versions: 1.7.4 Environment: Linux Reporter: Alex Hanna I've got a rather simple script that takes Twitter data in JSON and turns it into an Avro file. from avro import schema, datafile, io import json, sys from types import * def main(): if len(sys.argv) < 2: print "Usage: cat input.json | python2.7 JSONtoAvro.py output" return s = schema.parse(open("tweet.avsc").read()) f = open(sys.argv[1], 'wb') writer = datafile.DataFileWriter(f, io.DatumWriter(), s, codec = 'deflate') failed = 0 for line in sys.stdin: line = line.strip() try: data = json.loads(line) except ValueError as detail: continue try: writer.append(data) except io.AvroTypeException as detail: print line failed += 1 writer.close() print str(failed) + " failed in schema" if __name__ == '__main__': main() >From there, I use this to feed a basic Hadoop Streaming script (also in >Python) which just pulls out certain elements of the tweets. However, when I >do this, it appears that the input for the script is mangled JSON. Usually the >JSON fails with some errant \u in the middle of the tweet body or user-defined >description. The Streaming script is rather basic -- it reads from sys.stdin and attempts to parse the JSON string using the json package. Here is the bash script I use to invoke Hadoop Streaming: jars=/usr/lib/hadoop/lib/avro-1.7.1.cloudera.2.jar,/usr/lib/hive/lib/avro-mapred-1.7.1.cloudera.2.jar hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -files $jars,$HOME/sandbox/hadoop/streaming/map/tweetMapper.py,$HOME/sandbox/hadoop/streaming/data/keywords.txt,$HOME/sandbox/hadoop/streaming/data/follow-r3.txt \ -libjars $jars \ -input /user/ahanna/avrotest/avrotest.json.avro \ -output output \ -mapper "tweetMapper.py -a" \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ -inputformat org.apache.avro.mapred.AvroAsTextInputFormat \ -numReduceTasks 1 I'm starting to think this is a bug with org.apache.avro.mapred.AvroAsTextInputFormat? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (AVRO-1266) Fix mapred AvroMultipleOutputs class to write the schema to Jobconf rather than private Hashmap
[ https://issues.apache.org/jira/browse/AVRO-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594782#comment-13594782 ] Pierre Mariani commented on AVRO-1266: -- schemaA is not null. I confirmed that I am not getting the NullPointerException with the new patch, but bugs and difficulties in configuring the job in my own code are preventing me from confirming that it works. I am still trying and will update if I get anywhere. > Fix mapred AvroMultipleOutputs class to write the schema to Jobconf rather > than private Hashmap > --- > > Key: AVRO-1266 > URL: https://issues.apache.org/jira/browse/AVRO-1266 > Project: Avro > Issue Type: Bug > Components: java >Reporter: Ashish Nagavaram >Assignee: Ashish Nagavaram > Fix For: 1.7.5 > > Attachments: AVRO-1266.patch, AVRO-1266.patch, AVRO-1266-v1.patch > > > The current version of mapred AvroMultipleOutputs stores schemas in provate > hashmap which has issues when run in a mapreduce code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira