[jira] [Updated] (AVRO-1268) Add java-class, java-key-class and java-element-class support for stringable types to SpecificData

2013-03-06 Thread Alexandre Normand (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Normand updated AVRO-1268:


Attachment: AVRO-1268.patch

Thanks for the comments Doug. I wasn't thinking of the API compatibility but I 
revisited the changes after reading your comments and: 
  * I settled on leaving the addStringable in ReflectData while having the 
default list of stringables in SpecificData. I think the use-case for 
SpecificData would be for one of the default JDK classes that are stringable by 
default or one could potentially write the schema to generate the 
{{@Stringable}} annotation. Using a SpecificData#addStringable is probably not 
a very common use case.

  * I brought the PROP constants in ReflectData but marked them as deprecated 
with the javadoc to redirected to SpecificData's constants. 

 * As for the {{StringablesRecord}} not needing to be checked in, I'm missing 
something. The avdl files for that record are in {{java/compiler}} and 
{{java/maven-plugin}} while the test for the {{SpecificDatumReader}} is in 
{{java/avro}}. The generated class for that record won't show up in {{avro}} 
since {{compiler}} and {{maven-plugin}} both depend on {{avro}} to have built 
first. Also, the other specific record used in {{SpecificDatumReader}} is 
{{FooBarSpecificRecord}} and that one is also in the source. What am I missing?

  * I added a Specific test to {{Perf}} with the {{FooBarSpecificRecord}} (a 
good reference since that's the one that was there prior to any of my changes). 
I ran the modified {{Perf}} with and without my changes and it's not looking 
great:

Before
{code}
Executing tests: 
[FooBarSpecificRecordTest]
 readTests:true
 writeTests:true
 cycles=800
test name timeM entries/sec   M bytes/sec  
bytes/cycle
 FooBarSpecificRecordTestRead:  16086 ms   1.03660.413   1214805
FooBarSpecificRecordTestWrite:   8794 ms   1.895   110.505   1214805
{code}

After:
{code}
Executing tests: 
[FooBarSpecificRecordTest]
 readTests:true
 writeTests:true
 cycles=800
test name timeM entries/sec   M bytes/sec  
bytes/cycle
 FooBarSpecificRecordTestRead:  23937 ms   0.69640.600   1214805
FooBarSpecificRecordTestWrite:  11369 ms   1.46685.481   1214805
{code}

I'm adding the latest patch with the changes above. It's possible that the 
performance hit could be less if I were to remove support for stringable 
array/map elements and stringable keys and just keep the {{@java-class}} 
support. I'd like not to do that though as this seems like a weird place to 
leave things. 

I would like if we could find a solution to mitigate the performance drop as I 
think this is still a desirable feature.

Thoughts?

> Add java-class, java-key-class and java-element-class support for stringable 
> types to SpecificData
> --
>
> Key: AVRO-1268
> URL: https://issues.apache.org/jira/browse/AVRO-1268
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.4
>Reporter: Alexandre Normand
>Assignee: Alexandre Normand
>Priority: Minor
> Fix For: 1.7.5
>
> Attachments: AVRO-1268.patch, AVRO-1268.patch
>
>
> Stringable types are java classes that can be serialized through strings 
> (which require a single string constructor and a valid toString() 
> implementation). ReflectData currently has support from stringable types but 
> it would be desirable to get this feature with SpecificData. 
> The work involves changes to the SpecificCompiler (depends on {{@java-class}} 
> support in AVRO-1267) to generate the specific sources with the proper java 
> type as well as moving the ReflectDatumReader and ReflectDatumWriter to read 
> the java-class/java-key-class and java-element-class properties. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (AVRO-1268) Add java-class, java-key-class and java-element-class support for stringable types to SpecificData

2013-03-06 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595146#comment-13595146
 ] 

Doug Cutting commented on AVRO-1268:


Some quick thoughts on the patch:
 - ReflectData#addStringable's result type changes incompatibly.  This might be 
fixable with generics.  We technically can include incompatible API changes in 
a 1.8 release, but I'd prefer not to, as they make it hard for folks to update 
to the latest version of Avro.  We might fix this by just using a different 
method name in SpecificData and deprecating but keeping the method in 
ReflectData.
 - Similarly, removing ReflectData.CLASS_PROP, etc. is an incompatible API 
change.  We could just leave these constants where they are and use them from 
SpecificData, or we could deprecate them and add them to Specific, but we 
should not simply remove them if we want to maintain compatibility.
 - StringablesRecord.java shouldn't need to be checked in.  This should be 
generated by the build in the target/ directory.
 - It might be good to check that performance isn't altered.  We might modify 
Perf.java to be able to use SpecificDatumReader/Writer and see that this patch 
doesn't affect things adversely.

> Add java-class, java-key-class and java-element-class support for stringable 
> types to SpecificData
> --
>
> Key: AVRO-1268
> URL: https://issues.apache.org/jira/browse/AVRO-1268
> Project: Avro
>  Issue Type: Improvement
>  Components: java
>Affects Versions: 1.7.4
>Reporter: Alexandre Normand
>Assignee: Alexandre Normand
>Priority: Minor
> Fix For: 1.7.5
>
> Attachments: AVRO-1268.patch
>
>
> Stringable types are java classes that can be serialized through strings 
> (which require a single string constructor and a valid toString() 
> implementation). ReflectData currently has support from stringable types but 
> it would be desirable to get this feature with SpecificData. 
> The work involves changes to the SpecificCompiler (depends on {{@java-class}} 
> support in AVRO-1267) to generate the specific sources with the proper java 
> type as well as moving the ReflectDatumReader and ReflectDatumWriter to read 
> the java-class/java-key-class and java-element-class properties. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (AVRO-1271) Hadoop Streaming mangles Python-produced Avro

2013-03-06 Thread Alex Hanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/AVRO-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Hanna resolved AVRO-1271.
--

Resolution: Later

Moving this to avro-users.

> Hadoop Streaming mangles Python-produced Avro
> -
>
> Key: AVRO-1271
> URL: https://issues.apache.org/jira/browse/AVRO-1271
> Project: Avro
>  Issue Type: Bug
>  Components: java, python
>Affects Versions: 1.7.4
> Environment: Linux
>Reporter: Alex Hanna
>
> I've got a rather simple script that takes Twitter data in JSON and turns it 
> into an Avro file.
> from avro import schema, datafile, io
> import json, sys
> from types import *
> def main():
> if len(sys.argv) < 2:
> print "Usage: cat input.json | python2.7 JSONtoAvro.py output"
> return
> s = schema.parse(open("tweet.avsc").read())
> f = open(sys.argv[1], 'wb')
> writer = datafile.DataFileWriter(f, io.DatumWriter(), s, codec = 
> 'deflate')
> failed = 0
> for line in sys.stdin:
> line = line.strip()
> try:
> data = json.loads(line)
> except ValueError as detail:
> continue
> try:
> writer.append(data)
> except io.AvroTypeException as detail:
> print line
> failed += 1
> writer.close()
> print str(failed) + " failed in schema"
> if __name__ == '__main__':
> main()
> From there, I use this to feed a basic Hadoop Streaming script (also in 
> Python) which just pulls out certain elements of the tweets. However, when I 
> do this, it appears that the input for the script is mangled JSON. Usually 
> the JSON fails with some errant \u in the middle of the tweet body or 
> user-defined description.
> The Streaming script is rather basic -- it reads from sys.stdin and attempts 
> to parse the JSON string using the json package.
> Here is the bash script I use to invoke Hadoop Streaming:
> jars=/usr/lib/hadoop/lib/avro-1.7.1.cloudera.2.jar,/usr/lib/hive/lib/avro-mapred-1.7.1.cloudera.2.jar
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -files 
> $jars,$HOME/sandbox/hadoop/streaming/map/tweetMapper.py,$HOME/sandbox/hadoop/streaming/data/keywords.txt,$HOME/sandbox/hadoop/streaming/data/follow-r3.txt
>  \
>  -libjars $jars \
>  -input /user/ahanna/avrotest/avrotest.json.avro \
>  -output output \
>  -mapper "tweetMapper.py -a" \
>  -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
>  -inputformat org.apache.avro.mapred.AvroAsTextInputFormat \
>  -numReduceTasks 1
> I'm starting to think this is a bug with 
> org.apache.avro.mapred.AvroAsTextInputFormat?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (AVRO-1271) Hadoop Streaming mangles Python-produced Avro

2013-03-06 Thread Alex Hanna (JIRA)
Alex Hanna created AVRO-1271:


 Summary: Hadoop Streaming mangles Python-produced Avro
 Key: AVRO-1271
 URL: https://issues.apache.org/jira/browse/AVRO-1271
 Project: Avro
  Issue Type: Bug
  Components: java, python
Affects Versions: 1.7.4
 Environment: Linux
Reporter: Alex Hanna


I've got a rather simple script that takes Twitter data in JSON and turns it 
into an Avro file.

from avro import schema, datafile, io
import json, sys
from types import *

def main():
if len(sys.argv) < 2:
print "Usage: cat input.json | python2.7 JSONtoAvro.py output"
return

s = schema.parse(open("tweet.avsc").read())
f = open(sys.argv[1], 'wb')

writer = datafile.DataFileWriter(f, io.DatumWriter(), s, codec = 'deflate')

failed = 0

for line in sys.stdin:
line = line.strip()

try:
data = json.loads(line)
except ValueError as detail:
continue

try:
writer.append(data)
except io.AvroTypeException as detail:
print line
failed += 1

writer.close()

print str(failed) + " failed in schema"

if __name__ == '__main__':
main()

>From there, I use this to feed a basic Hadoop Streaming script (also in 
>Python) which just pulls out certain elements of the tweets. However, when I 
>do this, it appears that the input for the script is mangled JSON. Usually the 
>JSON fails with some errant \u in the middle of the tweet body or user-defined 
>description.

The Streaming script is rather basic -- it reads from sys.stdin and attempts to 
parse the JSON string using the json package.

Here is the bash script I use to invoke Hadoop Streaming:

jars=/usr/lib/hadoop/lib/avro-1.7.1.cloudera.2.jar,/usr/lib/hive/lib/avro-mapred-1.7.1.cloudera.2.jar

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files 
$jars,$HOME/sandbox/hadoop/streaming/map/tweetMapper.py,$HOME/sandbox/hadoop/streaming/data/keywords.txt,$HOME/sandbox/hadoop/streaming/data/follow-r3.txt
 \
 -libjars $jars \
 -input /user/ahanna/avrotest/avrotest.json.avro \
 -output output \
 -mapper "tweetMapper.py -a" \
 -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
 -inputformat org.apache.avro.mapred.AvroAsTextInputFormat \
 -numReduceTasks 1

I'm starting to think this is a bug with 
org.apache.avro.mapred.AvroAsTextInputFormat?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (AVRO-1266) Fix mapred AvroMultipleOutputs class to write the schema to Jobconf rather than private Hashmap

2013-03-06 Thread Pierre Mariani (JIRA)

[ 
https://issues.apache.org/jira/browse/AVRO-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594782#comment-13594782
 ] 

Pierre Mariani commented on AVRO-1266:
--

schemaA is not null.

I confirmed that I am not getting the NullPointerException with the new patch, 
but bugs and difficulties in configuring the job in my own code are preventing 
me from confirming that it works. I am still trying and will update if I get 
anywhere.

> Fix mapred AvroMultipleOutputs class to write the schema to Jobconf rather 
> than private Hashmap
> ---
>
> Key: AVRO-1266
> URL: https://issues.apache.org/jira/browse/AVRO-1266
> Project: Avro
>  Issue Type: Bug
>  Components: java
>Reporter: Ashish Nagavaram
>Assignee: Ashish Nagavaram
> Fix For: 1.7.5
>
> Attachments: AVRO-1266.patch, AVRO-1266.patch, AVRO-1266-v1.patch
>
>
> The current version of mapred AvroMultipleOutputs stores schemas in provate 
> hashmap which has issues when run in a mapreduce code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira