Re: Question on writing/reading file with different schema

Scott Carey Thu, 15 Apr 2010 10:24:07 -0700

So a concrete example of the workaround is to change:

{code}
public void browseName() throws IOException {
 List<Field> fields = new ArrayList<Field>();
 fields.add(new Field("First", Schema.create(Type.STRING), null, null));
 fields.add(new Field("Last", Schema.create(Type.STRING), null, null)); 
 Schema extractSchema = Schema.createRecord(fields);
 DataFileReader<Record> reader = new DataFileReader<Record>(new File(
   fileName), new GenericDatumReader<Record>(extractSchema));
 try {
   while (reader.hasNext()) {
     Record person = reader.next();
     System.out.print(person.get("First").toString() + " " + 
person.get("Last").toString() + "\t");
   }
 } finally {
   reader.close();
 }
}
{code}


{code}
public void browseName() throws IOException {
 DataFileReader<Record> reader = new DataFileReader<Record>(new File(
   fileName), new GenericDatumReader<Record>());
 try {
   while (reader.hasNext()) {
     Record person = reader.next();
     System.out.print(person.get("First").toString() + " " + 
person.get("Last").toString() + "\t");
   }
 } finally {
   reader.close();
 }
}
{code}

On Apr 15, 2010, at 10:15 AM, Scott Carey wrote:

> It appears there is a bug in the ResolvingDecoder when the actual schema has 
> trailing fields not in the expected schema.  I have not had time to track it 
> down.  I filed a JIRA ticket: https://issues.apache.org/jira/browse/AVRO-517.
> 
> I have a suggested work-around.  You probably don't want to explicitly use a 
> different reader schema than the file.  The primary use case of the schema 
> resolving is for schema evolution and migration.  Most of the time a single 
> version of an application will want to use a single schema to represent the 
> data.
> If you simply want to read 2 of 3 fields, read 2 of three fields from the 
> full schema -- don't define a schema with only 2 of the three fields.
> Every client can use the full "Person" schema, but wrapper classes or helper 
> methods can read the subset of the fields they want to.
> 
> In the example below, browseAge() and browseName() can use personSchema, 
> there is no need to create or manage the other two schemas.  I am not sure if 
> that applies to your real-world usage, but it likely does.
> 
> -Scott
> 
> On Apr 13, 2010, at 12:04 AM, Lurga wrote:
> 
>> Hi,
>> 27
>> 20
>> 31
>> Dante Hicks  
>> Exception in thread "main" org.apache.avro.AvroRuntimeException: 
>> java.io.EOFException
>>      at org.apache.avro.file.DataFileStream.next(DataFileStream.java:184)
>>      at cn.znest.test.avro.AddressBook.browseName(AddressBook.java:91)
>>      at cn.znest.test.avro.AddressBook.main(AddressBook.java:43)
>> Caused by: java.io.EOFException
>>      at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:163)
>>      at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:262)
>>      at 
>> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:93)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:277)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:271)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:105)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:77)
>>      at 
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:70)
>>      at org.apache.avro.file.DataFileStream.next(DataFileStream.java:195)
>>      at org.apache.avro.file.DataFileStream.next(DataFileStream.java:182)
>>      ... 2 more
>> 
>> My code is below. In this example, I create three record: person(3 fileds: 
>> First Last Age), age(Age), extract(First Last). The record "age" has the 
>> last filed of "Person", so AddressBook.browseAge() will be executed 
>> successfully. But the record "extract" does not have the last filed of 
>> "Person", so executing AddressBook.browseName() will cause an exception.  
>> In avro/c, read_record (datum_read.c) loops every write_schema fileds. 
>> In avro/java, GenericDatumReader.readRecord  loops every read_schema fileds. 
>> I think that's the point.
>> 
>> import java.io.File;
>> import java.io.IOException;
>> 
>> import org.apache.avro.Schema;
>> import org.apache.avro.file.DataFileReader;
>> import org.apache.avro.file.DataFileWriter;
>> import org.apache.avro.generic.GenericData;
>> import org.apache.avro.generic.GenericDatumReader;
>> import org.apache.avro.generic.GenericDatumWriter;
>> import org.apache.avro.generic.GenericData.Record;
>> import org.apache.avro.util.Utf8;
>> 
>> public class AddressBook {
>>      String fileName = "AddressBook.db";
>>      String prefix = "{\"type\":\"record\",\"name\": 
>> \"Person\",\"fields\":[";
>>      String suffix = "]}";
>>      String fieldFirst = "{\"name\":\"First\",\"type\":\"string\"}";
>>      String fieldLast = "{\"name\":\"Last\",\"type\":\"string\"}";
>>      String fieldAge = "{\"name\":\"Age\",\"type\":\"int\"}";
>>      Schema personSchema = Schema.parse(prefix + fieldFirst + "," + 
>> fieldLast + ","  + fieldAge + suffix);
>>      Schema ageSchema = Schema.parse(prefix + fieldAge + suffix);
>>      Schema extractSchema = Schema.parse(prefix + fieldFirst + "," + 
>> fieldLast + suffix);
>>      /**
>>       * @param args
>>       * @throws IOException
>>       */
>>      public static void main(String[] args) throws IOException {
>>              AddressBook ab = new AddressBook();
>>              ab.init();
>>              ab.browseAge();
>>              ab.browseName();
>>      }
>> 
>>      public void init() throws IOException {         
>>              DataFileWriter<Record> writer = new DataFileWriter<Record>(
>>                              new 
>> GenericDatumWriter<Record>(personSchema)).create(
>>                                              personSchema, new 
>> File(fileName));
>>              try {
>>                      writer.append(createPerson("Dante", "Hicks", 27));
>>                      writer.append(createPerson("Randal", "Graves", 20));
>>                      writer.append(createPerson("Steve", "Jobs", 31));
>>              } finally {
>>                      writer.close();
>>              }
>>      }
>>      
>>      private Record createPerson(String first, String last, int age) {
>>              Record person = new GenericData.Record(personSchema);
>>              person.put("First", new Utf8(first));
>>              person.put("Last", new Utf8(last));
>>              person.put("Age", age);
>>              return person;
>>      }
>>      
>>      public void browseAge() throws IOException {            
>>              GenericDatumReader<Record> dr = new 
>> GenericDatumReader<Record>();
>>              dr.setExpected(ageSchema);
>>              DataFileReader<Record> reader = new DataFileReader<Record>(new 
>> File(
>>                fileName), dr);
>>              
>>              try {
>>                      while (reader.hasNext()) {
>>                              Record person = reader.next();
>>                              
>> System.out.println(person.get("Age").toString());
>>                      }
>>              } finally {
>>                      reader.close();
>>              }
>>      }
>>      
>>      public void browseName() throws IOException {           
>>              GenericDatumReader<Record> dr = new 
>> GenericDatumReader<Record>();
>>              dr.setExpected(extractSchema);
>>              DataFileReader<Record> reader = new DataFileReader<Record>(new 
>> File(
>>                fileName), dr);
>>              
>>              try {
>>                      while (reader.hasNext()) {
>>                              Record person = reader.next();
>>                              
>> System.out.println(person.get("First").toString() + " " + 
>> person.get("Last").toString() + "\t");
>>                      }
>>              } finally {
>>                      reader.close();
>>              }
>>      }
>> }
>> 
>> 
>> 2010-04-13 
>> Lurga            
>> 
>> 发件人： Scott Carey 
>> 发送时间： 2010-04-13  10:55:41 
>> 收件人： [email protected] 
>> 抄送： 
>> 主题： Re: Question on writing/reading file with different schema 
>> 
>> On Apr 12, 2010, at 7:42 PM, Scott Carey wrote:
>>> Try something like:
>>> 
>>> DatumReader dr = new GenericDatumReader();
>>> dr.setExpected(extractSchema);
>>> DataFileReader<Record> reader = new DataFileReader<Record>(new File(
>>> fileName), dr);
>>> 
>> If the above still fails, please provide the full stack trace that results.
>>> 
>>>> Regards,
>>>> 
>>>> 2010-04-13 
>>>> 
>>>> 
>>>> 
>>>> Lurga 
>>>> 
>>> 
>

Re: Question on writing/reading file with different schema

Reply via email to