Fwd: Writer interface start

Jason Altekruse Wed, 02 Oct 2013 11:20:52 -0700

Please provide any feedback on the start of the writer interface described
in the attached document. It should be a more formalized interface in the
next few days.

-Jason

---------- Forwarded message ----------
From: Jason Altekruse <[email protected]>
Date: Wed, Oct 2, 2013 at 1:12 PM
Subject: Fwd: Writer interface start
To: [email protected]

---------- Forwarded message ----------
From: Jason Altekruse <[email protected]>
Date: Wed, Oct 2, 2013 at 12:31 PM
Subject: Writer interface start
To: Jacques Nadeau <[email protected]>, Ben Becker <[email protected]>,
Steven Phillips <[email protected]>

A quick update on the status of the writer interface. I haven't written it
formally yet but I put together a document describing the important design
considerations for various formats, trying to be as general as possible.
Should be more fleshed out in detail in the next few days.

See attached
-Jason

Considerations for Drill writer interface:

- two major types of output formats
    - column major
        - ORC and RC File
        - parquet
        - block compressed sequence files
    - row major
        - CSV
        - basic sequence files
        - JSON

Considerations for column major:
    - much more state heavy for reading/writing
        - two states to manage, fill level of VV, section of file already on 
disk 
        (or in buffer ready to be written to disk when complete section of file
        is full)
            - at some level the file will be buffered in sections, keeping 
track of 
              when new buffering needs to take place needs to happen alongside
              management of the fill level of VVs
        - keeping track of numbers of records processed
        - handling cutoffs for less frequent split points in files
            - with parquet there is a range of sizes for each sub-component of 
the
              file
            - we have to stay within these ranges while working within the drill
              model of not knowing how much data will come in the next batch
        - need to build up larger in memory structures to be written all at once
          to disk
        - sizes of different parts of file need to be recorded 
          in file level meta-data, as well as schema (parquet, might be     
          applicable for others)
            - schema changes can result in new file creation, or a refactor
              of previously written batches to include nulls for newly     
              discovered columns in later batches
            - this would involve parsing the written part of the file, adding
              a column or columns full of nulls and re-writing
            - need to balance good file sizes with minimizing re-processing time
    - for efficient processing should always define a translation between
      columnar in-memory Value Vector structures of Drill and various formats
        - do not want to pull values out of one format and place them 
          individually in the other
            - this applies for reading as well as writing
    - Various compression algorithms
        - value compression, Run Length Encoding (RLE), Bit-Packing (see 
          parquet documentation for more specific information on these         
          techniques)
        - general compression algorithms, Snappy, gzip, etc.
            - applied to blocks of values, which may be value compressed already

Considerations for Row Major formats:
    - Will obviously have to pull individual values out of Value Vectors
    - Depending on existing APIs, might have to create objects to pass into
      existing writer code for a given format
        - should try to avoid new object creation for each record
        - try to reuse objects, or simply handle individual primitives

General Considerations:
    - try to use as much existing code as possible while keeping writing 
      efficient
    - need to define a syntax for specifying encodings, compression types, 
      columns to be written destination, etc. in logical and physical plan
        - will be very format specific

===============================
current record reader interface
===============================
-  RecordReader, SchemaProvider, 

public interface SchemaProvider {
  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(SchemaProvider.class);
  
  public Object getSelectionBaseOnName(String tableName);

}
 
public interface RecordReader {

  /**
   * Configure the RecordReader with the provided schema and the record batch 
   * that should be written to.
   * 
   * @param output
   *   The place where output for a particular scan should be written. The 
   *   record reader is responsible for mutating the set of schema values 
   *   for that particular record.
   * @throws ExecutionSetupException
   */
  public abstract void setup(OutputMutator output) throws 
ExecutionSetupException;

  /**
   * Increment record reader forward, writing into the provided output batch.  
   * 
   * @return The number of additional records added to the output.
   */
  public abstract int next();

  public abstract void cleanup();

}

Fwd: Writer interface start

Reply via email to