[
https://issues.apache.org/jira/browse/AVRO-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Janosch Woschitz updated AVRO-1720:
-----------------------------------
Description:
If you're dealing with bigger avro files (>100MB) it would be nice to have a
way to quickly count the amount of records contained within that file.
With the current state of avro-tools the only way to achieve this (to my
current knowledge) is to dump the data to json and count the amount of records.
For bigger files this might take a while due to the serialization overhead and
since every record needs to be looked at.
I added a new tool which is optimized for counting records, it does not
serialize the records and reads only the block count for each block.
{panel:title=Naive benchmark}
{noformat}
# the input file had a size of ~300MB
$ du -sh sample.avro
323M sample.avro
# using the new count tool
$ time java -jar avro-tools.jar count sample.avro
331439
real 0m4.670s
user 0m6.167s
sys 0m0.513s
# the current way of counting records
$ time java -jar avro-tools.jar tojson sample.avro | wc
331439 54904484 1838231743
real 0m52.760s
user 1m42.317s
sys 0m3.209s
# the overhead of wc is rather minor
$ time java -jar avro-tools.jar tojson sample.avro > /dev/null
real 0m47.834s
user 0m53.317s
sys 0m1.194s
{noformat}
{panel}
This tool uses the HDFS API to handle files from any supported filesystem. I
added the unit tests to the already existing TestDataFileTools since it
provided convenient utility functions which I could reuse for my test scenarios.
was:
If you're dealing with bigger avro files (>100MB) it would be nice to have a
way to quickly count the amount of records contained within that file.
With the current state of avro-tools the only way to achieve this (to my
current knowledge) is to dump the data to json and count the amount of records.
For bigger files this might take a while due to the serialization overhead and
since every record needs to be looked at.
I added a new tool which is optimized for counting records, it does not
serialize the records and reads only the block count for each block.
{panel:title=Naive benchmark}
# the input file had a size of ~300MB
$ du -sh sample.avro
323M sample.avro
# using the new count tool
$ time java -jar avro-tools.jar count sample.avro
331439
real 0m4.670s
user 0m6.167s
sys 0m0.513s
# the current way of counting records
$ time java -jar avro-tools.jar tojson sample.avro | wc
331439 54904484 1838231743
real 0m52.760s
user 1m42.317s
sys 0m3.209s
# the overhead of wc is rather minor
$ time java -jar avro-tools.jar tojson sample.avro > /dev/null
real 0m47.834s
user 0m53.317s
sys 0m1.194s
{panel}
This tool uses the HDFS API to handle files from any supported filesystem. I
added the unit tests to the already existing TestDataFileTools since it
provided convenient utility functions which I could reuse for my test scenarios.
> Add an avro-tool to count records in an avro file
> -------------------------------------------------
>
> Key: AVRO-1720
> URL: https://issues.apache.org/jira/browse/AVRO-1720
> Project: Avro
> Issue Type: New Feature
> Components: java
> Reporter: Janosch Woschitz
> Priority: Minor
> Attachments: AVRO-1720.patch
>
>
> If you're dealing with bigger avro files (>100MB) it would be nice to have a
> way to quickly count the amount of records contained within that file.
> With the current state of avro-tools the only way to achieve this (to my
> current knowledge) is to dump the data to json and count the amount of
> records. For bigger files this might take a while due to the serialization
> overhead and since every record needs to be looked at.
> I added a new tool which is optimized for counting records, it does not
> serialize the records and reads only the block count for each block.
> {panel:title=Naive benchmark}
> {noformat}
> # the input file had a size of ~300MB
> $ du -sh sample.avro
> 323M sample.avro
> # using the new count tool
> $ time java -jar avro-tools.jar count sample.avro
> 331439
> real 0m4.670s
> user 0m6.167s
> sys 0m0.513s
> # the current way of counting records
> $ time java -jar avro-tools.jar tojson sample.avro | wc
> 331439 54904484 1838231743
> real 0m52.760s
> user 1m42.317s
> sys 0m3.209s
> # the overhead of wc is rather minor
> $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
> real 0m47.834s
> user 0m53.317s
> sys 0m1.194s
> {noformat}
> {panel}
> This tool uses the HDFS API to handle files from any supported filesystem. I
> added the unit tests to the already existing TestDataFileTools since it
> provided convenient utility functions which I could reuse for my test
> scenarios.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)