Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/1047#discussion_r159799621
--- Diff: exec/vector/src/main/codegen/templates/BaseWriter.java ---
@@ -106,37 +114,37 @@
MapOrListWriter list(String name);
boolean isMapWriter();
boolean isListWriter();
- UInt1Writer uInt1(String name);
- UInt2Writer uInt2(String name);
- UInt4Writer uInt4(String name);
- UInt8Writer uInt8(String name);
- VarCharWriter varChar(String name);
- Var16CharWriter var16Char(String name);
- TinyIntWriter tinyInt(String name);
- SmallIntWriter smallInt(String name);
- IntWriter integer(String name);
- BigIntWriter bigInt(String name);
- Float4Writer float4(String name);
- Float8Writer float8(String name);
- BitWriter bit(String name);
- VarBinaryWriter varBinary(String name);
+ UInt1Writer uInt1(String name, TypeProtos.DataMode dataMode);
--- End diff --
Really not sure we want to do this. These writers are also used in JSON,
and are used for every field in every object. Now, every request to get a
writer will have to pass the mode. This seems like we are making the problem
far, far more complex than necessary.
The current code has rules for the type to choose. In general, the type is
OPTIONAL for single (scalar) values and REPEATED for repeated (array) values.
The mode passed here *cannot* be REPEATED: just won't work. So, can we pass
REQUIRED?
Let's think about how these are used in JSON. In JSON, we discover fields
as we read each object. A field need not appear in the first object, it might
appear 20 objects in. Then, after the 25th object, it may not ever appear again.
So, in JSON, we can *never* use REQUIRED; we *must* use OPTIONAL. Key
reason: JSON provides no schema and Drill cannot predict what the schema will
turn out to be.
Now, let's move to Parquet. Parquet does have a schema. In fact, with
Parquet, we know the schema before we read the first row. And, in Parquet, a
scalar column can be REQUIRED or OPTIONAL.
All of this suggests a better solution. (Indeed, the solution implemented
in the new column writer layer.) Allow Parquet to declare an "early" schema.
That is, prior to the first row, call methods that declare each column with its
cardinality.
Then, when reading the field, always pass the name as today. If the field
is new, it *must* be OPTIONAL. Otherwise, it will use whatever was used before.
Let's say this another way. Below these methods is a call to an
`addOrGet()` method. In Parquet, call those methods before the first row to do
the "add" part. Then, later, the method will do only the "get" part.
The result is that you won't have to modify so many files, won't have to
complicate the APIs and won't have to worry about a client passing in REPEATED
mode for a scalar column.
---