[ 
https://issues.apache.org/jira/browse/AVRO-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved AVRO-2802.
------------------------------------
    Fix Version/s: 1.10.0
       Resolution: Fixed

> Pre-Size List in AvroInputFormat Avro File Lookup
> -------------------------------------------------
>
>                 Key: AVRO-2802
>                 URL: https://issues.apache.org/jira/browse/AVRO-2802
>             Project: Apache Avro
>          Issue Type: Improvement
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Minor
>             Fix For: 1.10.0
>
>
> {code:java}
>     if (job.getBoolean(IGNORE_FILES_WITHOUT_EXTENSION_KEY, 
> IGNORE_INPUTS_WITHOUT_EXTENSION_DEFAULT)) {
>       List<FileStatus> result = new ArrayList<>();
>       for (FileStatus file : super.listStatus(job))
>         if (file.getPath().getName().endsWith(AvroOutputFormat.EXT))
>           result.add(file);
>       return result.toArray(new FileStatus[0]);
>     } else {
>       return super.listStatus(job);
>     }
> {code}
> When a user runs an Avro MR job against a directory, it silently filters out 
> files without an avro file extension. Fair enough.  However, anecdotally, 
> this is the primary use scenario, so this code probably does not filter out 
> many files.
> I suggest that this {{ArrayList}} be pre-sized.  If there are a lot of files, 
> and all of them have the avro file extension (base case), this {{ArrayList}} 
> will had to be expanded multiple times (time and GC).  If there is a large 
> list and it gets filtered down a lot, a few hundred bytes are wasted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to