[ https://issues.apache.org/jira/browse/AVRO-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vincenz Priesnitz updated AVRO-1307: ------------------------------------ Attachment: AVRO-1307.patch > Add an avro-tool to extract samples from avro files > --------------------------------------------------- > > Key: AVRO-1307 > URL: https://issues.apache.org/jira/browse/AVRO-1307 > Project: Avro > Issue Type: New Feature > Components: java > Environment: java > Reporter: Vincenz Priesnitz > Priority: Minor > Attachments: AVRO-1307.patch > > > It would be nice to have an avro-tool that picks only some records from avro > files. > > I implemented a new avro-tool cat, which takes a list of avro files with > identical schemas and concatenates them into a single file, with options to > discard the first n records, to limit the output size and to collect records > at a certain samplerate. > This tool allows a quicker peek into large avro files, e.g.: > {code} > java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10 > # creates output.avro that contains records > # 51 to 60 from input.avro. > {\code} > {code} > java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 > --samplerate .01 > # samples every hundredth record from input, > # beginning at the 1000th record and limiting > # the output to 100 records. > {\code} > The tool allows multiple input files or folders, in which case all files > inside the folder will be used for input. > {code} > java -jar avro-tools.jar cat data_folder output.avro --samplerate .01 > # reads all the files from the data folder and > # writes every 100th record into the output file. > {\code} > This tool uses the hadoop FileSystem api to handle files from any supported > filesystem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira