Janosch Woschitz created AVRO-1720:
--------------------------------------

             Summary: Add an avro-tool to count records in an avro file
                 Key: AVRO-1720
                 URL: https://issues.apache.org/jira/browse/AVRO-1720
             Project: Avro
          Issue Type: New Feature
          Components: java
            Reporter: Janosch Woschitz
            Priority: Minor


If you're dealing with bigger avro files (>100MB) it would be nice to have a 
way to quickly count the amount of records contained within that file.

With the current state of avro-tools the only way to achieve this (to my 
current knowledge) is to dump the data to json and count the amount of records. 
For bigger files this might take a while due to the serialization overhead and 
since every record needs to be looked at.

I added a new tool which is optimized for counting records, it does not 
serialize the records and reads only the block count for each block.

{panel:title=Naive benchmark}
# the input file had a size of ~300MB
$ du -sh sample.avro 
323M    sample.avro

# using the new count tool
$ time java -jar avro-tools.jar count sample.avro
331439

real    0m4.670s
user    0m6.167s
sys 0m0.513s

# the current way of counting records
$ time java -jar avro-tools.jar tojson sample.avro | wc
331439 54904484 1838231743

real    0m52.760s
user    1m42.317s
sys 0m3.209s

# the overhead of wc is rather minor
$ time java -jar avro-tools.jar tojson sample.avro > /dev/null

real    0m47.834s
user    0m53.317s
sys 0m1.194s
{panel}

This tool uses the HDFS API to handle files from any supported filesystem. I 
added the unit tests to the already existing TestDataFileTools since it 
provided convenient utility functions which I could reuse for my test scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to