Vincenz Priesnitz created AVRO-1307: ---------------------------------------
Summary: Add an avro-tool to extract samples from avro files Key: AVRO-1307 URL: https://issues.apache.org/jira/browse/AVRO-1307 Project: Avro Issue Type: New Feature Components: java Environment: java Reporter: Vincenz Priesnitz Priority: Minor It would be nice to have an avro-tool that picks only some records from avro files. I implemented a new avro-tool cat, which takes a list of avro files with identical schemas and concatenates them into a single file, with options to discard the first n records, to limit the output size and to collect records at a certain samplerate. This tool allows a quicker peek into large avro files, e.g.: {code} java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10 # creates output.avro that contains records # 51 to 60 from input.avro. {\code} {code} java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 --samplerate .01 # samples every hundredth record from input, # beginning at the 1000th record and limiting # the output to 100 records. {\code} The tool allows multiple input files or folders, in which case all files inside the folder will be used for input. {code} java -jar avro-tools.jar cat data_folder output.avro --samplerate .01 # reads all the files from the data folder and # writes every 100th record into the output file. {\code} This tool uses the hadoop FileSystem api to handle files from any supported filesystem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira