[jira] [Created] (AVRO-1339) AvroSequenceFile is always uncompressed

David Arthur (JIRA) Thu, 23 May 2013 06:03:23 -0700

David Arthur created AVRO-1339:
----------------------------------

             Summary: AvroSequenceFile is always uncompressed
                 Key: AVRO-1339
                 URL: https://issues.apache.org/jira/browse/AVRO-1339
             Project: Avro
          Issue Type: Bug
            Reporter: David Arthur



It appears that AvroSequenceFile is not passing compression type/codec info 
down to the SequenceFile.Writer. This is because AvroSequenceFile.Writer is 
making a direct call to SequenceFile.Writer's public constructor rather than 
using one of the SequenceFile createWriter factory methods

https://github.com/apache/avro/blob/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSequenceFile.java#L532

Here is a bit of workaround code that I came up with

{code:java}
AvroSequenceFile.Writer.Options options = new AvroSequenceFile.Writer.Options()
  .withConfiguration(hdfsInfo.getConf())
  .withFileSystem(hdfsInfo.getFileSystem())
  .withOutputPath(hdfsInfo.getPath())
  .withCompressionType(configuration.getCompressionType())
  .withCompressionCodec(configuration.getCompressionCodec().getCodec())
  .withProgressable(new Progressable() {
      @Override
      public void progress(){

      }
  })
  .withKeySchema(configuration.getKeySchema())
  .withValueSchema(configuration.getValueSchema());

// Have to do this here b/c it's hidden in a private method :(
Metadata metadata = options.getMetadata();
if (null != configuration.getKeySchema()) {
  metadata.set(AvroSequenceFile.METADATA_FIELD_KEY_SCHEMA, new 
Text(configuration.getKeySchema().toString()));
}
if (null != configuration.getValueSchema()) {
  metadata.set(AvroSequenceFile.METADATA_FIELD_VALUE_SCHEMA, new 
Text(configuration.getValueSchema().toString()));
}

return SequenceFile.createWriter(
    options.getFileSystem(),
    options.getConfigurationWithAvroSerialization(),
    options.getOutputPath(),
    options.getKeyClass(),
    options.getValueClass(),
    options.getBufferSizeBytes(),
    options.getReplicationFactor(),
    options.getBlockSizeBytes(),
    options.getCompressionType(),
    options.getCompressionCodec(),
    options.getProgressable(),
    metadata);
{code}

I used this code to write a BZIP2 block compressed sequence file, and was able 
to read it using the Avro mapreduce classes just fine.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (AVRO-1339) AvroSequenceFile is always uncompressed

Reply via email to