[GitHub] [iceberg] shardulm94 commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

GitBox Mon, 03 Aug 2020 20:30:05 -0700


shardulm94 commented on a change in pull request #1271:
URL: https://github.com/apache/iceberg/pull/1271#discussion_r464729558




##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -288,8 +289,10 @@ public void nonNullWrite(int rowId, LocalDate data, 
ColumnVector output) {
     @Override
     public void nonNullWrite(int rowId, OffsetDateTime data, ColumnVector 
output) {
       TimestampColumnVector cv = (TimestampColumnVector) output;
-      cv.time[rowId] = data.toInstant().toEpochMilli(); // millis
-      cv.nanos[rowId] = (data.getNano() / 1_000) * 1_000; // truncate nanos to 
only keep microsecond precision
+      // millis
+      cv.time[rowId] = data.toInstant().toEpochMilli();
+      // truncate nanos to only keep microsecond precision
+      cv.nanos[rowId] = Math.floorDiv(data.getNano(), 1_000) * 1_000;

Review comment:
       data.getNano() always returns positive integer, so is this change 
required?

##########
File path: 
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), 
value.precision(), value.scale());
+      BigDecimal decimal = new 
BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Review comment:
       `value.serialize64()` will take in an expected scale as a parameter, so 
I think the only change required to the original code is to pass our expected 
reader scale into `value.serialize64()` instead of passing `value.scale()`.

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -324,14 +329,24 @@ public void nonNullWrite(int rowId, LocalDateTime data, 
ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
-      // TODO: validate precision and scale from schema
+      Preconditions.checkArgument(data.scale() == scale,
+          "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, 
scale, data);
+      Preconditions.checkArgument(data.precision() <= precision && precision 
<= 18,

Review comment:
       Nit: `precision <= 18` check can be moved into the constructor

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
##########
@@ -237,9 +239,14 @@ public void addValue(int rowId, int column, 
SpecializedGetters data,
         output.noNulls = false;
         output.isNull[rowId] = true;
       } else {
+        Decimal decimal = data.getDecimal(column, precision, scale);
+        Preconditions.checkArgument(scale == decimal.scale(),
+            "Cannot write value as decimal(%s,%s), wrong scale: %s", 
precision, scale, decimal);
+        Preconditions.checkArgument(decimal.precision() <= precision && 
precision <= 18,
+            "Cannot write value as decimal(%s,%s), invalid precision: %s", 
decimal);

Review comment:
       This check seems redundant to me. If we are already passing our expected 
precision and scale to `data.getDecimal()`, wont the scale and precision of the 
returned decimal always match?

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -288,8 +289,10 @@ public void nonNullWrite(int rowId, LocalDate data, 
ColumnVector output) {
     @Override
     public void nonNullWrite(int rowId, OffsetDateTime data, ColumnVector 
output) {
       TimestampColumnVector cv = (TimestampColumnVector) output;
-      cv.time[rowId] = data.toInstant().toEpochMilli(); // millis
-      cv.nanos[rowId] = (data.getNano() / 1_000) * 1_000; // truncate nanos to 
only keep microsecond precision
+      // millis
+      cv.time[rowId] = data.toInstant().toEpochMilli();
+      // truncate nanos to only keep microsecond precision
+      cv.nanos[rowId] = Math.floorDiv(data.getNano(), 1_000) * 1_000;

Review comment:
       If it is indeed needed, we should also update `TimestampWriter`.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
##########
@@ -261,9 +268,14 @@ public void addValue(int rowId, int column, 
SpecializedGetters data,
         output.isNull[rowId] = true;
       } else {
         output.isNull[rowId] = false;
-        ((DecimalColumnVector) output).vector[rowId].set(
-            HiveDecimal.create(data.getDecimal(column, precision, scale)
-                .toJavaBigDecimal()));
+
+        Decimal decimal = data.getDecimal(column, precision, scale);
+        Preconditions.checkArgument(scale == decimal.scale(),
+            "Cannot write value as decimal(%s,%s), wrong scale: %s", 
precision, scale, decimal);
+        Preconditions.checkArgument(decimal.precision() <= precision && 
precision <= 38,
+            "Cannot write value as decimal(%s,%s), invalid precision: %s", 
precision, scale, decimal);

Review comment:
       This check seems redundant to me. If we are already passing our expected 
precision and scale to `data.getDecimal()`, wont the scale and precision of the 
returned decimal always match?

##########
File path: 
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -212,6 +218,10 @@ public Decimal nonNullRead(ColumnVector vector, int row) {
     public Decimal nonNullRead(ColumnVector vector, int row) {
       BigDecimal value = ((DecimalColumnVector) vector).vector[row]
           .getHiveDecimal().bigDecimalValue();
+
+      Preconditions.checkArgument(value.precision() <= precision && precision 
<= 38,

Review comment:
       Nit: `precision <= 38` check can be moved into the constructor

##########
File path: data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
##########
@@ -340,7 +355,11 @@ public void nonNullWrite(int rowId, BigDecimal data, 
ColumnVector output) {
 
     @Override
     public void nonNullWrite(int rowId, BigDecimal data, ColumnVector output) {
-      // TODO: validate precision and scale from schema
+      Preconditions.checkArgument(data.scale() == scale,
+          "Cannot write value as decimal(%s,%s), wrong scale: %s", precision, 
scale, data);
+      Preconditions.checkArgument(data.precision() <= precision && precision 
<= 38,

Review comment:
       Nit: `precision <= 38` check can be moved into the constructor

##########
File path: 
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
##########
@@ -195,7 +196,12 @@ public Long nonNullRead(ColumnVector vector, int row) {
     @Override
     public Decimal nonNullRead(ColumnVector vector, int row) {
       HiveDecimalWritable value = ((DecimalColumnVector) vector).vector[row];
-      return new Decimal().set(value.serialize64(value.scale()), 
value.precision(), value.scale());
+      BigDecimal decimal = new 
BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());
+
+      Preconditions.checkArgument(value.precision() <= precision && precision 
<= 18,

Review comment:
       Nit: `precision <= 18` check can be moved into the constructor




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1271: Align the records written by GenericOrcWriter and SparkOrcWriter

Reply via email to