[GitHub] drill pull request #1047: DRILL-5970: DrillParquetReader always builds the s...

paul-rogers Thu, 04 Jan 2018 18:22:05 -0800

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1047#discussion_r159799621
  
    --- Diff: exec/vector/src/main/codegen/templates/BaseWriter.java ---
    @@ -106,37 +114,37 @@
         MapOrListWriter list(String name);
         boolean isMapWriter();
         boolean isListWriter();
    -    UInt1Writer uInt1(String name);
    -    UInt2Writer uInt2(String name);
    -    UInt4Writer uInt4(String name);
    -    UInt8Writer uInt8(String name);
    -    VarCharWriter varChar(String name);
    -    Var16CharWriter var16Char(String name);
    -    TinyIntWriter tinyInt(String name);
    -    SmallIntWriter smallInt(String name);
    -    IntWriter integer(String name);
    -    BigIntWriter bigInt(String name);
    -    Float4Writer float4(String name);
    -    Float8Writer float8(String name);
    -    BitWriter bit(String name);
    -    VarBinaryWriter varBinary(String name);
    +    UInt1Writer uInt1(String name, TypeProtos.DataMode dataMode);
    --- End diff --
    
    Really not sure we want to do this. These writers are also used in JSON, 
and are used for every field in every object. Now, every request to get a 
writer will have to pass the mode. This seems like we are making the problem 
far, far more complex than necessary.
    
    The current code has rules for the type to choose. In general, the type is 
OPTIONAL for single (scalar) values and REPEATED for repeated (array) values.
    
    The mode passed here *cannot* be REPEATED: just won't work. So, can we pass 
REQUIRED?
    
    Let's think about how these are used in JSON. In JSON, we discover fields 
as we read each object. A field need not appear in the first object, it might 
appear 20 objects in. Then, after the 25th object, it may not ever appear again.
    
    So, in JSON, we can *never* use REQUIRED; we *must* use OPTIONAL. Key 
reason: JSON provides no schema and Drill cannot predict what the schema will 
turn out to be.
    
    Now, let's move to Parquet. Parquet does have a schema. In fact, with 
Parquet, we know the schema before we read the first row. And, in Parquet, a 
scalar column can be REQUIRED or OPTIONAL.
    
    All of this suggests a better solution. (Indeed, the solution implemented 
in the new column writer layer.) Allow Parquet to declare an "early" schema. 
That is, prior to the first row, call methods that declare each column with its 
cardinality.
    
    Then, when reading the field, always pass the name as today. If the field 
is new, it *must* be OPTIONAL. Otherwise, it will use whatever was used before.
    
    Let's say this another way. Below these methods is a call to an 
`addOrGet()` method. In Parquet, call those methods before the first row to do 
the "add" part. Then, later, the method will do only the "get" part.
    
    The result is that you won't have to modify so many files, won't have to 
complicate the APIs and won't have to worry about a client passing in REPEATED 
mode for a scalar column.

---

[GitHub] drill pull request #1047: DRILL-5970: DrillParquetReader always builds the s...

Reply via email to