[jira] [Updated] (AVRO-1307) Add an avro-tool to extract samples from avro files

Vincenz Priesnitz (JIRA) Wed, 24 Apr 2013 01:54:30 -0700

     [ 
https://issues.apache.org/jira/browse/AVRO-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vincenz Priesnitz updated AVRO-1307:
------------------------------------

    Attachment: AVRO-1307-added UnitTests.patch

I added a Junit test that tests reading file input, reading directory input, 
reading incompatible input schemas, offset accuracy and  offsets/samplerates 
that jump beyond EOF.

I also shifted the file opening and closing process to the Util class. I 
therefore added a method to Util, that takes a List of filenames and returns a 
list of corresponding hadoop Pathes for those which are indeed files and for 
those which are directories it returns Pathes to all the files inside of them.  
I think this might be a useful method, since mapreduce task often create 
folders of avro files with matching schema.



                
> Add an avro-tool to extract samples from avro files
> ---------------------------------------------------
>
>                 Key: AVRO-1307
>                 URL: https://issues.apache.org/jira/browse/AVRO-1307
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>         Environment: java
>            Reporter: Vincenz Priesnitz
>            Priority: Minor
>         Attachments: AVRO-1307-added UnitTests.patch, AVRO-1307.patch
>
>
> It would be nice to have an avro-tool that picks only some records from avro 
> files.
>  
> I implemented a new avro-tool cat, which takes a list of avro files with 
> identical schemas and concatenates them into a single file, with options to 
> discard the first n records, to limit the output size and to collect records 
> at a certain samplerate.
> This tool allows a quicker peek into large avro files, e.g.:
> {code}
> java -jar avro-tools.jar cat input.avro output.avro --offset 50 --limit 10
> # creates output.avro that contains records
> # 51 to 60 from input.avro.
> {\code}
> {code}
> java -jar avro-tools.jar cat input.avro output.avro --offset 1000 --limit 100 
> --samplerate .01
> # samples every hundredth record from input,
> # beginning at the 1000th record and limiting
> # the output to 100 records. 
> {\code}
> The tool allows multiple input files or folders, in which case all files 
> inside the folder will be used for input.
> {code}
> java -jar avro-tools.jar cat data_folder output.avro --samplerate .01
> # reads all the files from the data folder and
> # writes every 100th record into the output file.
> {\code}
> This tool uses the hadoop FileSystem api to handle files from any supported 
> filesystem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1307) Add an avro-tool to extract samples from avro files

Reply via email to