Summary of the issue:
Using ParquetWriter VS insert overwrite to convert Avro to Parquet

In some version of Hive, the columns do not line up. 
Also presto does not seem to like the output ParquetWriter


Using the the 1.8.1 Maven package for the below java.


Is this a bug ? 







In using the following to convert Avro -> Parquet
DatumReader<GenericRecord> datumReader = new 
GenericDatumReader<>();DataFileReader<GenericRecord> dataFileReader = new 
DataFileReader<>(new File(args[0]), datumReader);Schema schema = 
dataFileReader.getSchema();
byte[] schemaBytes = Files.readAllBytes(Paths.get("/var/tmp/1.avsc"));String 
schemaString = new String(schemaBytes, StandardCharsets.UTF_8);schema = 
Schema.parse(schemaString);
System.out.println(schema.toString(true));

ParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>(new 
org.apache.hadoop.fs.Path(args[1]), schema, compressionCodecName, blockSize, 
pageSize);
GenericRecord record = null;
while (dataFileReader.hasNext()) {  record = dataFileReader.next(record);  
writer.write(record);}

I am getting an error when queuing the converted data using hive.
When I convert the Avro to Parquet using insert overwrite it works.
The difference between the 2 files is:
Not the object in Java/Avro is a array of structs


Using the ParquetWriter
optional group rtb_bidders (LIST) {    repeated group array {                   
     <------------- This does not appear to work      optional binary bidder_id 
(UTF8);      optional binary result (UTF8);      optional double bid_cpm;      
optional int64 bid_time;      optional binary creative_url (UTF8);      
optional binary third_party_cookie_id (UTF8);      optional binary deal_id 
(UTF8);      optional int32 error_code;      optional int32 campaign_id;      
optional binary rtb_creative_id (UTF8);      optional binary rtb_creative_url 
(UTF8);      optional double advised_floor_lift;      optional binary 
advised_floor_source (UTF8);      optional double winning_price_paid;      
optional binary seat_id (UTF8);      optional binary first_adinstance (UTF8);   
   optional binary tag_key (UTF8);      optional binary rtb_creative_size 
(UTF8);      optional int32 rtb_creative_width;      optional int32 
rtb_creative_height;    }  }
Using the insert overwrite hive sql to convert to Parquet 
 optional group rtb_bidders (LIST) {    repeated group bag {                    
       <------------------ (this looks correct)      optional group 
array_element {        optional binary bidder_id (UTF8);        optional binary 
result (UTF8);        optional double bid_cpm;        optional int64 bid_time;  
      optional binary creative_url (UTF8);        optional binary 
third_party_cookie_id (UTF8);        optional binary deal_id (UTF8);        
optional int32 error_code;        optional int32 campaign_id;        optional 
binary rtb_creative_id (UTF8);        optional binary rtb_creative_url (UTF8);  
      optional double advised_floor_lift;        optional binary 
advised_floor_source (UTF8);        optional double winning_price_paid;        
optional binary seat_id (UTF8);        optional binary first_adinstance (UTF8); 
       optional binary tag_key (UTF8);        optional binary rtb_creative_size 
(UTF8);        optional int32 rtb_creative_width;        optional int32 
rtb_creative_height;      }    }  }




Reply via email to