[ https://issues.apache.org/jira/browse/AVRO-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711047#comment-14711047 ]
Niels Basjes commented on AVRO-1720: ------------------------------------ Ok, apparently this is more complex than it seems. Please make sure the unit test includes the situation where there is at least a 'second block' in the test file. The current test file you (re)used seems to only have 10 records in a single block. > Add an avro-tool to count records in an avro file > ------------------------------------------------- > > Key: AVRO-1720 > URL: https://issues.apache.org/jira/browse/AVRO-1720 > Project: Avro > Issue Type: New Feature > Components: java > Reporter: Janosch Woschitz > Priority: Minor > Attachments: AVRO-1720.patch > > > If you're dealing with bigger avro files (>100MB) it would be nice to have a > way to quickly count the amount of records contained within that file. > With the current state of avro-tools the only way to achieve this (to my > current knowledge) is to dump the data to json and count the amount of > records. For bigger files this might take a while due to the serialization > overhead and since every record needs to be looked at. > I added a new tool which is optimized for counting records, it does not > serialize the records and reads only the block count for each block. > {panel:title=Naive benchmark} > {noformat} > # the input file had a size of ~300MB > $ du -sh sample.avro > 323M sample.avro > # using the new count tool > $ time java -jar avro-tools.jar count sample.avro > 331439 > real 0m4.670s > user 0m6.167s > sys 0m0.513s > # the current way of counting records > $ time java -jar avro-tools.jar tojson sample.avro | wc > 331439 54904484 1838231743 > real 0m52.760s > user 1m42.317s > sys 0m3.209s > # the overhead of wc is rather minor > $ time java -jar avro-tools.jar tojson sample.avro > /dev/null > real 0m47.834s > user 0m53.317s > sys 0m1.194s > {noformat} > {panel} > This tool uses the HDFS API to handle files from any supported filesystem. I > added the unit tests to the already existing TestDataFileTools since it > provided convenient utility functions which I could reuse for my test > scenarios. -- This message was sent by Atlassian JIRA (v6.3.4#6332)