I've seen a couple of recent pieces of work on generating new readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose a couple of guidelines to help ensure a high quality bar:
1. Design review first - Before someone starts implementing a particular reader/writer, let's ask for a basic design outline in jira, google docs, etc. 2. High bar for implementation: Having more readers for the sake of more readers should not be the goal of the project. Instead, people should expect Arrow Java readers to be high quality and faster than other readers (even if the consumer has to do a final conversion to move from the Arrow representation to their current internal representation). As such, I propose the following two bars as part of design work: 1. Field selection support as part of reads - Make sure that each implementation supports field selection (which columns to materialize) as part of the interface. 2. Configurable target batch size - Different systems will want to control the target size of batch data. 3. Minimize use of heap memory - Most of the core existing Arrow Java libraries have been very focused on minimizing on-heap memory consumption. While there may be some, we continue to try reduce the footprint as small as possible. When creating new readers/writers, I think we should target the same standard for new readers. For example, the current Avro reader PR relies heavily on the Java Avro project's reader implementation which has very poor heap characteristics. 4. Industry leading performance - People should expect that using Arrow stuff is very fast. Releasing something under this banner means we should focus on achieving that kind of target. To pick on the Avro reader again here, our previous analysis has shown that the Java Avro project's reader (not the Arrow connected impl) is frequently an order of magnitude+ slower than some other open source Avro readers (such as Impala's implementation), especially when applying any predicates or projections. 5. (Ideally) Predicate application as part of reads - 99% in workloads we've, a user is frequently applying one or more predicates when reading data. Whatever performance you gain from a strong implementation for reads will be drown out in most cases if you fail apply predicates as part of reading (and thus have to materialize far more records than you'll need in a minute). 3. Propose a generalized "reader" interface as opposed to making each reader have a different way to package/integrate. What do other people think?