Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/21975#discussion_r207542400
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
---
@@ -174,13 +174,15 @@ void readBatch(int total, WritableColumnVector
column) throws IOException {
// TIMESTAMP_MILLIS encoded as INT64 can't be lazily decoded as we
need to post process
// the values to add microseconds precision.
+ PrimitiveType.PrimitiveTypeName typeName =
+ descriptor.getPrimitiveType().getPrimitiveTypeName();
if (column.hasDictionary() || (rowId == 0 &&
- (descriptor.getType() == PrimitiveType.PrimitiveTypeName.INT32
||
- (descriptor.getType() == PrimitiveType.PrimitiveTypeName.INT64
&&
+ (typeName == PrimitiveType.PrimitiveTypeName.INT32 ||
--- End diff --
Same comment about Parquet, though I remain a little uneasy about it. I
just know it's a little more possible for different older Parquet to come in at
runtime, and if we're changing to use a newer field that doesn't exist in older
versions, could break. That said, I only didn't touch it because I didn't
investigate. I also think that if older Parquet is on the classpath we have
bigger problems, potentially. Especially if the newer field/method has existed
for several versions, this seems OK.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]