I've seen a couple of recent pieces of work on generating new
readers/writers for Arrow (Avro and discussion of CSV). I'd like to propose
a couple of guidelines to help ensure a high quality bar:

   1. Design review first - Before someone starts implementing a particular
   reader/writer, let's ask for a basic design outline in jira, google docs,
   etc.
   2. High bar for implementation: Having more readers for the sake of more
   readers should not be the goal of the project. Instead, people should
   expect Arrow Java readers to be high quality and faster than other readers
   (even if the consumer has to do a final conversion to move from the Arrow
   representation to their current internal representation). As such, I
   propose the following two bars as part of design work:
      1. Field selection support as part of reads - Make sure that each
      implementation supports field selection (which columns to materialize) as
      part of the interface.
      2. Configurable target batch size - Different systems will want to
      control the target size of batch data.
      3. Minimize use of heap memory - Most of the core existing Arrow Java
      libraries have been very focused on minimizing on-heap memory
consumption.
      While there may be some, we continue to try reduce the footprint as small
      as possible. When creating new readers/writers, I think we should target
      the same standard for new readers. For example, the current Avro
reader PR
      relies heavily on the Java Avro project's reader implementation which has
      very poor heap characteristics.
      4. Industry leading performance - People should expect that using
      Arrow stuff is very fast. Releasing something under this banner means we
      should focus on achieving that kind of target. To pick on the Avro reader
      again here, our previous analysis has shown that the Java Avro project's
      reader (not the Arrow connected impl) is frequently an order of
magnitude+
      slower than some other open source Avro readers (such as Impala's
      implementation), especially when applying any predicates or projections.
      5. (Ideally) Predicate application as part of reads - 99% in
      workloads we've, a user is frequently applying one or more
predicates when
      reading data. Whatever performance you gain from a strong implementation
      for reads will be drown out in most cases if you fail apply predicates as
      part of reading (and thus have to materialize far more records
than you'll
      need in a minute).
   3. Propose a generalized "reader" interface as opposed to making each
   reader have a different way to package/integrate.

What do other people think?

Reply via email to