[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-09-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/805


---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-09-20 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r140057495
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1054,8 +1057,36 @@ public void setMax(Object max) {
   return nulls;
 }
 
-@Override public boolean hasSingleValue() {
-  return (max != null && min != null && max.equals(min));
+/**
+ * Checks that the column chunk has a single value.
+ * Returns {@code true} if {@code min} and {@code max} are the same 
but not null
+ * and nulls count is 0 or equal to the rows count.
+ * 
+ * Returns {@code true} if {@code min} and {@code max} are null and 
the number of null values
+ * in the column chunk is equal to the rows count.
+ * 
+ * Comparison of nulls and rows count is needed for the cases:
+ * 
+ * column with primitive type has single value and null values
+ *
+ * column with primitive type has only null values, min/max 
couldn't be null,
+ * but column has single value
+ * 
+ *
+ * @param rowCount rows count in column chunk
+ * @return true if column has single value
+ */
+@Override
+public boolean hasSingleValue(long rowCount) {
+  if (nulls != null) {
+if (min != null) {
+  // Objects.deepEquals() is used here, since min and max may be 
byte arrays
+  return Objects.deepEquals(min, max) && (nulls == 0 || nulls == 
rowCount);
--- End diff --

Statistics [1] for most parquet types use java primitive types to store min 
and max values, so min/max can not be null even if the table has null values.

[1] 
https://github.com/apache/parquet-mr/tree/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/statistics


---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-09-20 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r140055857
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1054,8 +1057,36 @@ public void setMax(Object max) {
   return nulls;
 }
 
-@Override public boolean hasSingleValue() {
-  return (max != null && min != null && max.equals(min));
+/**
+ * Checks that the column chunk has a single value.
+ * Returns {@code true} if {@code min} and {@code max} are the same 
but not null
+ * and nulls count is 0 or equal to the rows count.
+ * 
+ * Returns {@code true} if {@code min} and {@code max} are null and 
the number of null values
+ * in the column chunk is equal to the rows count.
+ * 
+ * Comparison of nulls and rows count is needed for the cases:
+ * 
+ * column with primitive type has single value and null values
+ *
+ * column with primitive type has only null values, min/max 
couldn't be null,
+ * but column has single value
+ * 
+ *
+ * @param rowCount rows count in column chunk
+ * @return true if column has single value
+ */
+@Override
+public boolean hasSingleValue(long rowCount) {
+  if (nulls != null) {
+if (min != null) {
+  // Objects.deepEquals() is used here, since min and max may be 
byte arrays
+  return Objects.deepEquals(min, max) && (nulls == 0 || nulls == 
rowCount);
--- End diff --

- if (min != null), then nulls cannot be equal to rowCount
- In this case, only nulls == 0 should be checked 


---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-08-18 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r133927602
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarCharVector varCharVector = (NullableVarCharVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof String) { // if the metadata was read from a JSON 
cache file it maybe a string type
-  bytes = ((String) s).getBytes();
-} else if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varCharVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length);
 return;
   }
+  case INTERVAL: {
+NullableIntervalVector intervalVector = (NullableIntervalVector) v;
+Object s = partitionValueMap.get(f).get(column);
+byte[] bytes;
+if (s == null) {
+  intervalVector.getMutator().setNull(index);
+  return;
+} else {
+  bytes = getBytes(type, s);
+}
+intervalVector.getMutator().setSafe(index, 1,
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 0),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 4),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 8));
+return;
+  }
   default:
 throw new UnsupportedOperationException("Unsupported type: " + 
type);
 }
   }
 
+  /**
+   * Returns the sequence of bytes received from {@code Object source}.
+   *
+   * @param type   the column type
+   * @param source the source of the bytes sequence
+   * @return bytes sequence obtained from {@code Object source}
+   */
+  private byte[] getBytes(MinorType type, Object source) {
+byte[] bytes;
+if (source instanceof String) { // if the metadata was read from a 
JSON cache file it maybe a string type
+  bytes = Base64.decodeBase64(((String) source).getBytes());
--- End diff --

> Note, however, that the casting is unfortunate. The caller knows the 
type. Better to have:

The caller does not know the type since `source` is received from the map 
of objects. So the casting should be used there.

> Seems we'd also need a to/from Base64 method so that, on read, we do:

Value vector mutator interface does not have a method that allows passing 
bytes. Such methods implemented in the concrete implementation of mutators, so 
we could not create such to/from Base64 method.

> Strings need not be Base64 encoded

They should be Base64 encoded since byte array may have an encoding that 
differs from the UTF-8.

I moved this conversion from the string to a byte array into the 
`ParquetReaderUtility.correctBinaryInMetadataCache()` method. So now we are 
considering the metadata version and chose the way how to encode the string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-08-18 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126637301
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath 
schemaPath) {
 
   public void populatePruningVector(ValueVector v, int index, SchemaPath 
column, String file) {
 String f = Path.getPathWithoutSchemeAndAuthority(new 
Path(file)).toString();
-MinorType type = getTypeForColumn(column).getMinorType();
+MajorType majorType = getTypeForColumn(column);
+MinorType type = majorType.getMinorType();
 switch (type) {
+  case BIT: {
+NullableBitVector bitVector = (NullableBitVector) v;
+Boolean value = (Boolean) partitionValueMap.get(f).get(column);
+if (value == null) {
+  bitVector.getMutator().setNull(index);
+} else {
+  bitVector.getMutator().setSafe(index, value ? 1 : 0);
+}
+return;
+  }
   case INT: {
 NullableIntVector intVector = (NullableIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case SMALLINT: {
 NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-smallIntVector.getMutator().setSafe(index, value.shortValue());
+if (value == null) {
+  smallIntVector.getMutator().setNull(index);
+} else {
+  smallIntVector.getMutator().setSafe(index, value.shortValue());
+}
 return;
   }
   case TINYINT: {
 NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-tinyIntVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  tinyIntVector.getMutator().setNull(index);
+} else {
+  tinyIntVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT1: {
 NullableUInt1Vector intVector = (NullableUInt1Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT2: {
 NullableUInt2Vector intVector = (NullableUInt2Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, (char) value.shortValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, (char) value.shortValue());
+}
 return;
   }
   case UINT4: {
 NullableUInt4Vector intVector = (NullableUInt4Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case BIGINT: {
 NullableBigIntVector bigIntVector = (NullableBigIntVector) v;
 Long value = (Long) partitionValueMap.get(f).get(column);
-bigIntVector.getMutator().setSafe(index, value);
+if (value == null) {
+  bigIntVector.getMutator().setNull(index);
+} else {
+  bigIntVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT4: {
 NullableFloat4Vector float4Vector = (NullableFloat4Vector) v;
 Float value = (Float) partitionValueMap.get(f).get(column);
-float4Vector.getMutator().setSafe(index, value);
+if (value == null) {
+  float4Vector.getMutator().setNull(index);
+} else {
+  float4Vector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT8: {
 NullableFloat8Vector float8Vector = 

[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-08-18 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r133933369
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +186,101 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file has version less than
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
--- End diff --

`ColumnMetadata_v1`  class serializes / deserializes `primitiveType` field 
so it does not hard to check that field has `BINARY` type.
But `ColumnMetadata_v2` and `ColumnMetadata_v3` does not serialize this 
field. They have classes `ColumnTypeMetadata_v2` and `ColumnTypeMetadata_v3` 
respectively which contain information about the field name, primitive and 
original types. In this method, we are receiving fields with `BINARY` type 
taking information about the field type from `ColumnTypeMetadata_v2` or 
`ColumnTypeMetadata_v3`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-08-18 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126640837
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath 
schemaPath) {
 
   public void populatePruningVector(ValueVector v, int index, SchemaPath 
column, String file) {
 String f = Path.getPathWithoutSchemeAndAuthority(new 
Path(file)).toString();
-MinorType type = getTypeForColumn(column).getMinorType();
+MajorType majorType = getTypeForColumn(column);
+MinorType type = majorType.getMinorType();
 switch (type) {
+  case BIT: {
+NullableBitVector bitVector = (NullableBitVector) v;
+Boolean value = (Boolean) partitionValueMap.get(f).get(column);
+if (value == null) {
+  bitVector.getMutator().setNull(index);
+} else {
+  bitVector.getMutator().setSafe(index, value ? 1 : 0);
+}
+return;
+  }
   case INT: {
 NullableIntVector intVector = (NullableIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case SMALLINT: {
 NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-smallIntVector.getMutator().setSafe(index, value.shortValue());
+if (value == null) {
+  smallIntVector.getMutator().setNull(index);
+} else {
+  smallIntVector.getMutator().setSafe(index, value.shortValue());
+}
 return;
   }
   case TINYINT: {
 NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-tinyIntVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  tinyIntVector.getMutator().setNull(index);
+} else {
+  tinyIntVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT1: {
 NullableUInt1Vector intVector = (NullableUInt1Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT2: {
 NullableUInt2Vector intVector = (NullableUInt2Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, (char) value.shortValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, (char) value.shortValue());
+}
 return;
   }
   case UINT4: {
 NullableUInt4Vector intVector = (NullableUInt4Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case BIGINT: {
 NullableBigIntVector bigIntVector = (NullableBigIntVector) v;
 Long value = (Long) partitionValueMap.get(f).get(column);
-bigIntVector.getMutator().setSafe(index, value);
+if (value == null) {
+  bigIntVector.getMutator().setNull(index);
+} else {
+  bigIntVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT4: {
 NullableFloat4Vector float4Vector = (NullableFloat4Vector) v;
 Float value = (Float) partitionValueMap.get(f).get(column);
-float4Vector.getMutator().setSafe(index, value);
+if (value == null) {
+  float4Vector.getMutator().setNull(index);
+} else {
+  float4Vector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT8: {
 NullableFloat8Vector float8Vector = 

[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-10 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126565179
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath 
schemaPath) {
 
   public void populatePruningVector(ValueVector v, int index, SchemaPath 
column, String file) {
 String f = Path.getPathWithoutSchemeAndAuthority(new 
Path(file)).toString();
-MinorType type = getTypeForColumn(column).getMinorType();
+MajorType majorType = getTypeForColumn(column);
+MinorType type = majorType.getMinorType();
 switch (type) {
+  case BIT: {
+NullableBitVector bitVector = (NullableBitVector) v;
+Boolean value = (Boolean) partitionValueMap.get(f).get(column);
+if (value == null) {
+  bitVector.getMutator().setNull(index);
+} else {
+  bitVector.getMutator().setSafe(index, value ? 1 : 0);
+}
+return;
+  }
   case INT: {
 NullableIntVector intVector = (NullableIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case SMALLINT: {
 NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-smallIntVector.getMutator().setSafe(index, value.shortValue());
+if (value == null) {
+  smallIntVector.getMutator().setNull(index);
+} else {
+  smallIntVector.getMutator().setSafe(index, value.shortValue());
+}
 return;
   }
   case TINYINT: {
 NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-tinyIntVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  tinyIntVector.getMutator().setNull(index);
+} else {
+  tinyIntVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT1: {
 NullableUInt1Vector intVector = (NullableUInt1Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT2: {
 NullableUInt2Vector intVector = (NullableUInt2Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, (char) value.shortValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, (char) value.shortValue());
+}
 return;
   }
   case UINT4: {
 NullableUInt4Vector intVector = (NullableUInt4Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case BIGINT: {
 NullableBigIntVector bigIntVector = (NullableBigIntVector) v;
 Long value = (Long) partitionValueMap.get(f).get(column);
-bigIntVector.getMutator().setSafe(index, value);
+if (value == null) {
+  bigIntVector.getMutator().setNull(index);
+} else {
+  bigIntVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT4: {
 NullableFloat4Vector float4Vector = (NullableFloat4Vector) v;
 Float value = (Float) partitionValueMap.get(f).get(column);
-float4Vector.getMutator().setSafe(index, value);
+if (value == null) {
+  float4Vector.getMutator().setNull(index);
+} else {
+  float4Vector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT8: {
 NullableFloat8Vector float8Vector = 

[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-10 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126568096
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +186,101 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file has version less than
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
--- End diff --

This is very hard to follow for those of us not familiar with the details 
of the metadata file format. Please include a comment in the PR conversation 
that shows an example of the old and new formats. That will make it easy to 
double-check that the logic makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-10 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126567484
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarCharVector varCharVector = (NullableVarCharVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof String) { // if the metadata was read from a JSON 
cache file it maybe a string type
-  bytes = ((String) s).getBytes();
-} else if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varCharVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length);
 return;
   }
+  case INTERVAL: {
+NullableIntervalVector intervalVector = (NullableIntervalVector) v;
+Object s = partitionValueMap.get(f).get(column);
+byte[] bytes;
+if (s == null) {
+  intervalVector.getMutator().setNull(index);
+  return;
+} else {
+  bytes = getBytes(type, s);
+}
+intervalVector.getMutator().setSafe(index, 1,
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 0),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 4),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 8));
+return;
+  }
   default:
 throw new UnsupportedOperationException("Unsupported type: " + 
type);
 }
   }
 
+  /**
+   * Returns the sequence of bytes received from {@code Object source}.
+   *
+   * @param type   the column type
+   * @param source the source of the bytes sequence
+   * @return bytes sequence obtained from {@code Object source}
+   */
+  private byte[] getBytes(MinorType type, Object source) {
+byte[] bytes;
+if (source instanceof String) { // if the metadata was read from a 
JSON cache file it maybe a string type
+  bytes = Base64.decodeBase64(((String) source).getBytes());
--- End diff --

The data here should be assumed to be binary. Binary can have any 
characters, even JSON symbols. For example:

"value:" : "look a quote",\n"bogus" : { 10, "foo" }"

The above will make a mash of any JSON structure. Better to encode binary 
data as Base 64, which is designed to fit within a text format.

Note, however, that the casting is unfortunate. The caller knows the type. 
Better to have:

byte[] getBytes(String value) ...
byte[] getBytes(byte[] value) ...
byte[] getBytes(Binary value) ...
{code}

Seems we'd also need a to/from Base64 method so that, on read, we do:

* Get string value.
* Pass string value to Base64 decoder, get bytes
* Pass bytes to vector, using vector's code to understand byte meaning

On encoding:

* Get the vector value as a byte array
* Base64 encode the array
* Write it to JSON as a string

All that said, Strings need not be Base64 encoded if we have a JSON encoder 
that will escape JSON special characters. This works as long as the JSON file 
itself is UTF-8 encoded because then each Unicode string has a UTF-8 equivalent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-10 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126564587
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath 
schemaPath) {
 
   public void populatePruningVector(ValueVector v, int index, SchemaPath 
column, String file) {
 String f = Path.getPathWithoutSchemeAndAuthority(new 
Path(file)).toString();
-MinorType type = getTypeForColumn(column).getMinorType();
+MajorType majorType = getTypeForColumn(column);
+MinorType type = majorType.getMinorType();
 switch (type) {
+  case BIT: {
+NullableBitVector bitVector = (NullableBitVector) v;
+Boolean value = (Boolean) partitionValueMap.get(f).get(column);
+if (value == null) {
+  bitVector.getMutator().setNull(index);
+} else {
+  bitVector.getMutator().setSafe(index, value ? 1 : 0);
+}
+return;
+  }
   case INT: {
 NullableIntVector intVector = (NullableIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case SMALLINT: {
 NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-smallIntVector.getMutator().setSafe(index, value.shortValue());
+if (value == null) {
+  smallIntVector.getMutator().setNull(index);
+} else {
+  smallIntVector.getMutator().setSafe(index, value.shortValue());
+}
 return;
   }
   case TINYINT: {
 NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-tinyIntVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  tinyIntVector.getMutator().setNull(index);
+} else {
+  tinyIntVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT1: {
 NullableUInt1Vector intVector = (NullableUInt1Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value.byteValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value.byteValue());
+}
 return;
   }
   case UINT2: {
 NullableUInt2Vector intVector = (NullableUInt2Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, (char) value.shortValue());
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, (char) value.shortValue());
+}
 return;
   }
   case UINT4: {
 NullableUInt4Vector intVector = (NullableUInt4Vector) v;
 Integer value = (Integer) partitionValueMap.get(f).get(column);
-intVector.getMutator().setSafe(index, value);
+if (value == null) {
+  intVector.getMutator().setNull(index);
+} else {
+  intVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case BIGINT: {
 NullableBigIntVector bigIntVector = (NullableBigIntVector) v;
 Long value = (Long) partitionValueMap.get(f).get(column);
-bigIntVector.getMutator().setSafe(index, value);
+if (value == null) {
+  bigIntVector.getMutator().setNull(index);
+} else {
+  bigIntVector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT4: {
 NullableFloat4Vector float4Vector = (NullableFloat4Vector) v;
 Float value = (Float) partitionValueMap.get(f).get(column);
-float4Vector.getMutator().setSafe(index, value);
+if (value == null) {
+  float4Vector.getMutator().setNull(index);
+} else {
+  float4Vector.getMutator().setSafe(index, value);
+}
 return;
   }
   case FLOAT8: {
 NullableFloat8Vector float8Vector = 

[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-10 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126443099
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -445,12 +448,148 @@ public void testMetadataCacheAbsolutePaths() throws 
Exception {
   test("drop table if exists %s", RELATIVE_PATHS_METADATA);
 }
   }
+  @Test // DRILL-4139
+  public void testBooleanPartitionPruning() throws Exception {
--- End diff --

Thanks, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-10 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r126438246
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1008,8 +1008,24 @@ public void setMax(Object max) {
   return nulls;
 }
 
-@Override public boolean hasSingleValue() {
-  return (max != null && min != null && max.equals(min));
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if min and max are the same, but not null.
+ * Returns true if min and max are null and the number of null values
+ * in the column chunk is greater than 0.
+ *
+ * @return true if column has single value
--- End diff --

Thanks for such detailed analysis. This bug is also reproduced on master 
version, but it appears due to the wrong result of this method. Added a 
comparison of `nulls` with rows count to fix this bug and allow to return true 
for the case when field has only nulls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-05 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r125785209
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -445,12 +448,148 @@ public void testMetadataCacheAbsolutePaths() throws 
Exception {
   test("drop table if exists %s", RELATIVE_PATHS_METADATA);
 }
   }
+  @Test // DRILL-4139
+  public void testBooleanPartitionPruning() throws Exception {
--- End diff --

In addition to "=" comparison, can you please try with "is null" / "is not 
null" predicate, since one of this PR's objectives is to fix partition pruning 
dealing null values.  I saw another incorrect query result, although I'm not 
sure if it's a bug in the base code, or a bug in this PR.

```
create table dfs.tmp.`t6/a` as select col_notexist as mykey from 
cp.`tpch/region.parquet`;
create table dfs.tmp.`t6/b` as select 100 as mykey from 
cp.`tpch/region.parquet`;
select mykey from dfs.tmp.t6;
++
| mykey  |
++
| null   |
| null   |
| null   |
| null   |
| null   |
| 100|
| 100|
| 100|
| 100|
| 100|
++

select mykey from dfs.tmp.t6 where mykey is null;
++
| mykey  |
++
++

``` 

Notice that the last query returns 0 row, where it should return 5 rows. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-05 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r125784283
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1696,9 +1712,24 @@ public ColumnMetadata_v3(String[] name, 
PrimitiveTypeName primitiveType, Object
   return nulls;
 }
 
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if minValue and maxValue are the same, but not null.
+ * Returns true if minValue and maxValue are null and the number of 
null values
+ * in the column chunk is greater than 0.
+ *
+ * @return true if column has single value
+ */
 @Override
 public boolean hasSingleValue() {
-  return (minValue !=null && maxValue != null && 
minValue.equals(maxValue));
+  if (nulls != null) {
+if (minValue != null) {
+  return minValue.equals(maxValue);
--- End diff --

Same comment as above. I think in both v1 & v3, we should check nulls = 0 ( 
there is no nulls in the column). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-07-05 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r125784066
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1008,8 +1008,24 @@ public void setMax(Object max) {
   return nulls;
 }
 
-@Override public boolean hasSingleValue() {
-  return (max != null && min != null && max.equals(min));
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if min and max are the same, but not null.
+ * Returns true if min and max are null and the number of null values
+ * in the column chunk is greater than 0.
+ *
+ * @return true if column has single value
--- End diff --

My understanding is hasSingleValue() return true if the column meta data 
shows only one single value.  A null value is also counted as a different value 
from other non-null value.

Therefore, for the case of  column has min != null && max !=null && 
min.equals(max) && nulls!=null && nulls > 0, it should return false. However, 
in both the implementation of v1 and v3, it would return true. 

That would actually lead to wrong query result.  A simple reproduce:

```
create table dfs.tmp.`t5/a` as select 100 as mykey from 
cp.`tpch/nation.parquet` union all select col_notexist from 
cp.`tpch/region.parquet`;

create table dfs.tmp.`t5/b` as select 200 as mykey from 
cp.`tpch/nation.parquet` union all select col_notexist from 
cp.`tpch/region.parquet`;
```

We got two files, each having one single unique non-null value, plus null 
values. Now query the two files:

```
select mykey from dfs.tmp.`t5` where mykey = 100;
++
| mykey  |
++
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| 100|
| null   |
| null   |
| null   |
| null   |
| null   |
++
30 rows selected (0.246 seconds)

```
Apparently, those 5 nulls should not be returned. 

I applied the 3 commits in this PR on top of today's master branch.

```
select * from sys.version;

+--+---+---++-++
| version  | commit_id |
commit_message |
commit_time |   build_email   | build_time |

+--+---+---++-++
| 1.11.0-SNAPSHOT  | cad6e4dc950aa4a95ad20515ce5abd9c546d3e5d  | 
DRILL-4139: Fix loss of scale value for DECIMAL in parquet partition pruning  | 
05.07.2017 @ 12:05:25 PDT  | j...@apache.org  | 05.07.2017 @ 12:06:07 PDT  |
+--+---+-
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r118204069
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -398,10 +399,115 @@ public void testDrill4877() throws Exception {
 
   }
 
+  @Test // DRILL-4139
+  public void testBooleanPartitionPruning() throws Exception {
+final String boolPartitionTable = 
"dfs_test.tmp.`interval_bool_partition`";
+try {
+  test("create table %s partition by (col_bln) as " +
+"select * from cp.`parquet/alltypes_required.parquet`", 
boolPartitionTable);
+  test("refresh table metadata %s", boolPartitionTable);
--- End diff --

Thanks, modified tests to cover those cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117748249
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarCharVector varCharVector = (NullableVarCharVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof String) { // if the metadata was read from a JSON 
cache file it maybe a string type
-  bytes = ((String) s).getBytes();
-} else if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varCharVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length);
 return;
   }
+  case INTERVAL: {
+NullableIntervalVector intervalVector = (NullableIntervalVector) v;
+Object s = partitionValueMap.get(f).get(column);
+byte[] bytes;
+if (s == null) {
+  intervalVector.getMutator().setNull(index);
+  return;
+} else {
+  bytes = getBytes(type, s);
+}
+intervalVector.getMutator().setSafe(index, 1,
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 0),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 4),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 8));
+return;
+  }
   default:
 throw new UnsupportedOperationException("Unsupported type: " + 
type);
 }
   }
 
+  /**
+   * Returns the sequence of bytes received from {@code Object source}.
+   *
+   * @param type   the column type
+   * @param source the source of the bytes sequence
+   * @return bytes sequence obtained from {@code Object source}
+   */
+  private byte[] getBytes(MinorType type, Object source) {
+byte[] bytes;
+if (source instanceof String) { // if the metadata was read from a 
JSON cache file it maybe a string type
+  bytes = Base64.decodeBase64(((String) source).getBytes());
--- End diff --

It is necessary  for the cases when value has encoding order other than 
UTF-8. When creating string from the byte array for example with little-endian 
byte order without manually setting the charset, getBytes() would return byte 
array that differs from the passed into the String constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r118203458
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
than
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
+  public static void 
correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase 
parquetTableMetadata) {
+if (hasOldBinarySerialization(parquetTableMetadata)) {
+  Set names = Sets.newHashSet();
+  if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2) {
--- End diff --

Added check for ParquetTableMetadata_v3, thanks. In class ColumnMetadata_v1 
field `primitiveType` is serialized to the metadata cache file, so it is easier 
to check column type for this metadata version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117750872
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
than
+   * the version where was changed the serialization of BINARY values
--- End diff --

Yes, it is happening only in the metadata cache file. Serialization was 
wrong at least for interval and decimal types, since they encoded using 
little-endian and big-endian byte orders.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117737970
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarBinaryVector varBinaryVector = 
(NullableVarBinaryVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof String) {
-  bytes = ((String) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varBinaryVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varBinaryVector.getMutator().setSafe(index, bytes, 0, 
bytes.length);
 return;
   }
   case DECIMAL18: {
 NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) 
v;
-Long value = (Long) partitionValueMap.get(f).get(column);
+Object s = partitionValueMap.get(f).get(column);
+byte[] bytes;
+if (s == null) {
+  decimalVector.getMutator().setNull(index);
+  return;
+} else if (s instanceof Integer) {
+  decimalVector.getMutator().setSafe(index, (Integer) s);
+  return;
+} else if (s instanceof Long) {
+  decimalVector.getMutator().setSafe(index, (Long) s);
+  return;
+} else {
+  bytes = getBytes(type, s);
--- End diff --

Bytes will be received from partitionValueMap when decimal values are 
stored in parquet file as fixed_len_byte_array or binary. There are unit test 
that covers this case and cases when decimal values are stored as int32 and 
int64.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117736018
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarBinaryVector varBinaryVector = 
(NullableVarBinaryVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof String) {
-  bytes = ((String) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varBinaryVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varBinaryVector.getMutator().setSafe(index, bytes, 0, 
bytes.length);
 return;
   }
   case DECIMAL18: {
 NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) 
v;
-Long value = (Long) partitionValueMap.get(f).get(column);
+Object s = partitionValueMap.get(f).get(column);
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-24 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r118203553
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
than
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
+  public static void 
correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase 
parquetTableMetadata) {
+if (hasOldBinarySerialization(parquetTableMetadata)) {
+  Set names = Sets.newHashSet();
+  if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2) {
+for (Metadata.ColumnTypeMetadata_v2 columnTypeMetadata :
+((Metadata.ParquetTableMetadata_v2) 
parquetTableMetadata).columnTypeInfo.values()) {
+  if (columnTypeMetadata.primitiveType == 
PrimitiveTypeName.BINARY) {
+names.add(Arrays.asList(columnTypeMetadata.name));
+  }
+}
+  }
+  for (Metadata.ParquetFileMetadata file : 
parquetTableMetadata.getFiles()) {
+// Drill has only ever written a single row group per file, only 
need to correct the statistics
+// on the first row group
+Metadata.RowGroupMetadata rowGroupMetadata = 
file.getRowGroups().get(0);
--- End diff --

Thanks, fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117582970
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarBinaryVector varBinaryVector = 
(NullableVarBinaryVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof String) {
-  bytes = ((String) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varBinaryVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varBinaryVector.getMutator().setSafe(index, bytes, 0, 
bytes.length);
 return;
   }
   case DECIMAL18: {
 NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) 
v;
-Long value = (Long) partitionValueMap.get(f).get(column);
+Object s = partitionValueMap.get(f).get(column);
--- End diff --

If the patch also changes DECIMAL18's partition pruning, please modify the 
title of JIRA to reflect such change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117583804
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
than
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
+  public static void 
correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase 
parquetTableMetadata) {
+if (hasOldBinarySerialization(parquetTableMetadata)) {
+  Set names = Sets.newHashSet();
+  if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2) {
+for (Metadata.ColumnTypeMetadata_v2 columnTypeMetadata :
+((Metadata.ParquetTableMetadata_v2) 
parquetTableMetadata).columnTypeInfo.values()) {
+  if (columnTypeMetadata.primitiveType == 
PrimitiveTypeName.BINARY) {
+names.add(Arrays.asList(columnTypeMetadata.name));
+  }
+}
+  }
+  for (Metadata.ParquetFileMetadata file : 
parquetTableMetadata.getFiles()) {
+// Drill has only ever written a single row group per file, only 
need to correct the statistics
+// on the first row group
+Metadata.RowGroupMetadata rowGroupMetadata = 
file.getRowGroups().get(0);
--- End diff --

It's true that parquet files created by Drill have single RG per file. But 
the metadata file could be created from parquet files from other source. Such 
assumption may not be true.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117585451
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -398,10 +399,115 @@ public void testDrill4877() throws Exception {
 
   }
 
+  @Test // DRILL-4139
+  public void testBooleanPartitionPruning() throws Exception {
+final String boolPartitionTable = 
"dfs_test.tmp.`interval_bool_partition`";
+try {
+  test("create table %s partition by (col_bln) as " +
+"select * from cp.`parquet/alltypes_required.parquet`", 
boolPartitionTable);
+  test("refresh table metadata %s", boolPartitionTable);
--- End diff --

We probably want to cover the cases where metadata cache file is not 
created (run the query before call "refresh table metadata").  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117584586
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
than
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
+  public static void 
correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase 
parquetTableMetadata) {
+if (hasOldBinarySerialization(parquetTableMetadata)) {
+  Set names = Sets.newHashSet();
+  if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2) {
--- End diff --

Any reason you only check v2 here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117583114
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarBinaryVector varBinaryVector = 
(NullableVarBinaryVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof String) {
-  bytes = ((String) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varBinaryVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varBinaryVector.getMutator().setSafe(index, bytes, 0, 
bytes.length);
 return;
   }
   case DECIMAL18: {
 NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) 
v;
-Long value = (Long) partitionValueMap.get(f).get(column);
+Object s = partitionValueMap.get(f).get(column);
+byte[] bytes;
+if (s == null) {
+  decimalVector.getMutator().setNull(index);
+  return;
+} else if (s instanceof Integer) {
+  decimalVector.getMutator().setSafe(index, (Integer) s);
+  return;
+} else if (s instanceof Long) {
+  decimalVector.getMutator().setSafe(index, (Long) s);
+  return;
+} else {
+  bytes = getBytes(type, s);
--- End diff --

For DECIMAL18, under what kind scenarios, would we get a bytes from 
partitionValueMap? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117584380
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
than
+   * the version where was changed the serialization of BINARY values
--- End diff --

Can you explain a bit more about why the serialization of BINARY values are 
wrong in prior version? Is it only happening in the metadata cache file? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-05-19 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r117578486
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarCharVector varCharVector = (NullableVarCharVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof String) { // if the metadata was read from a JSON 
cache file it maybe a string type
-  bytes = ((String) s).getBytes();
-} else if (s instanceof Binary) {
-  bytes = ((Binary) s).getBytes();
-} else if (s instanceof byte[]) {
-  bytes = (byte[]) s;
+if (s == null) {
+  varCharVector.getMutator().setNull(index);
+  return;
 } else {
-  throw new UnsupportedOperationException("Unable to create column 
data for type: " + type);
+  bytes = getBytes(type, s);
 }
 varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length);
 return;
   }
+  case INTERVAL: {
+NullableIntervalVector intervalVector = (NullableIntervalVector) v;
+Object s = partitionValueMap.get(f).get(column);
+byte[] bytes;
+if (s == null) {
+  intervalVector.getMutator().setNull(index);
+  return;
+} else {
+  bytes = getBytes(type, s);
+}
+intervalVector.getMutator().setSafe(index, 1,
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 0),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 4),
+  ParquetReaderUtility.getIntFromLEBytes(bytes, 8));
+return;
+  }
   default:
 throw new UnsupportedOperationException("Unsupported type: " + 
type);
 }
   }
 
+  /**
+   * Returns the sequence of bytes received from {@code Object source}.
+   *
+   * @param type   the column type
+   * @param source the source of the bytes sequence
+   * @return bytes sequence obtained from {@code Object source}
+   */
+  private byte[] getBytes(MinorType type, Object source) {
+byte[] bytes;
+if (source instanceof String) { // if the metadata was read from a 
JSON cache file it maybe a string type
+  bytes = Base64.decodeBase64(((String) source).getBytes());
--- End diff --

Any reason you call Base4.decodeBase64, in stead of calling 
String.getBytes()? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-21 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r112660688
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2 &&
   }
 
   /**
+   * Checks that the metadata file was created by drill with version less 
that
+   * the version where was changed the serialization of BINARY values
+   * and assigns byte arrays to min/max values obtained from the 
deserialized string.
+   *
+   * @param parquetTableMetadata table metadata that should be corrected
+   */
+  public static void 
correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase 
parquetTableMetadata) {
+if (hasOldBinarySerialization(parquetTableMetadata)) {
+  Set names = Sets.newHashSet();
+  if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2) {
+for (Metadata.ColumnTypeMetadata_v2 columnTypeMetadata :
+((Metadata.ParquetTableMetadata_v2) 
parquetTableMetadata).columnTypeInfo.values()) {
+  if (columnTypeMetadata.primitiveType == 
PrimitiveTypeName.BINARY) {
+names.add(Arrays.asList(columnTypeMetadata.name));
+  }
+}
+  }
+  for (Metadata.ParquetFileMetadata file : 
parquetTableMetadata.getFiles()) {
+// Drill has only ever written a single row group per file, only 
need to correct the statistics
+// on the first row group
+Metadata.RowGroupMetadata rowGroupMetadata = 
file.getRowGroups().get(0);
+for (Metadata.ColumnMetadata columnMetadata : 
rowGroupMetadata.getColumns()) {
+  // Setting Min/Max values for ParquetTableMetadata_v1
+  if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v1
+|| parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v3) {
+PrimitiveTypeName primitiveType = 
columnMetadata.getPrimitiveType();
+if (primitiveType == PrimitiveTypeName.BINARY && 
columnMetadata.hasSingleValue()) {
+  Object minValue = columnMetadata.getMinValue();
+  if (minValue != null && minValue instanceof String) {
+byte[] bytes = ((String) minValue).getBytes();
+columnMetadata.setMax(bytes);
+columnMetadata.setMin(bytes);
+  }
+}
+  }
+  // Setting Max values for ParquetTableMetadata_v2
+  else if (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v2
+&& columnMetadata.hasSingleValue()
+&& 
names.contains(Arrays.asList(columnMetadata.getName( {
+Object maxValue = columnMetadata.getMaxValue();
+if (maxValue != null && maxValue instanceof String) {
+  byte[] bytes = ((String) maxValue).getBytes();
+  columnMetadata.setMax(bytes);
+}
+  }
+}
+  }
+}
+  }
+
+  /**
+   * Checks that the metadata file was created by drill with version less 
that
+   * the version where was changed the serialization of BINARY values.
+   *
+   * @param parquetTableMetadata the source of drill version
+   * @return true if metadata file was created by drill with version less 
that
+   * the version where was changed the serialization of BINARY values
+   */
+  public static boolean 
hasOldBinarySerialization(Metadata.ParquetTableMetadataBase 
parquetTableMetadata) {
+String drillVersion = parquetTableMetadata.getDrillVersion();
+return drillVersion == null
+|| new ComparableVersion(drillVersion).compareTo(new 
ComparableVersion("1.11.0-SNAPSHOT")) < 0;
--- End diff --

I am not sure that using version with SNAPSHOT is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-21 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r112651025
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -508,7 +511,10 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarBinaryVector varBinaryVector = 
(NullableVarBinaryVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof Binary) {
+if (s == null) {
+  varBinaryVector.getMutator().setNull(index);
--- End diff --

We can apply this setNull() only for types that use BinaryStatistics [1] 
for the reason described in my previous comment. So we apply setNull() to 
VARBINARY, DECIMAL18, VARCHAR and INTERVAL types.

[1] 
https://github.com/apache/parquet-mr/blob/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L25


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-21 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r112649665
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+<@pp.dropOutputFile />
+
+<#list cast.types as type>
+<#if type.major == "VarBinaryInterval">  <#-- Template to convert from 
VarBinary to Interval, IntervalYear, IntervalDay -->
+
+<@pp.changeOutputFile 
name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java"
 />
+
+<#include "/@includes/license.ftl" />
+
+package org.apache.drill.exec.expr.fn.impl.gcast;
+
+import io.netty.buffer.ByteBuf;
+
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import 
org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import org.apache.drill.exec.expr.holders.IntervalDayHolder;
+import org.apache.drill.exec.expr.holders.IntervalHolder;
+import org.apache.drill.exec.expr.holders.IntervalYearHolder;
+import org.apache.drill.exec.expr.holders.VarBinaryHolder;
+
+/*
+ * This class is generated using freemarker and the ${.template_name} 
template.
+ */
+@SuppressWarnings("unused")
+@FunctionTemplate(name = "cast${type.to?upper_case}",
+  scope = FunctionTemplate.FunctionScope.SIMPLE,
+  nulls = NullHandling.NULL_IF_NULL)
+public class Cast${type.from}To${type.to} implements DrillSimpleFunc {
+
+  @Param ${type.from}Holder in;
+  @Output ${type.to}Holder out;
+
+  public void setup() {
+  }
+
+  public void eval() {
+final byte[] buf = new byte[in.end - in.start];
+in.buffer.getBytes(in.start, buf, 0, in.end - in.start);
+  <#if type.to == "Interval">
+out.months = 
org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 
0);
--- End diff --

I made conclusion that Interval type encoded in this way when found that 
IntervalVector stores it's value in DrillBuf data as 3 int values.

But I had seen that there are a lot of other types encoded as you 
described, so you are right, thanks for the clarification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-21 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r112650691
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1668,9 +1666,25 @@ public ColumnMetadata_v3(String[] name, 
PrimitiveTypeName primitiveType, Object
   return nulls;
 }
 
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if minValue and maxValue are the same, but not null,
+ * and all column chunk values are not null.
+ * Returns true if minValue and maxValue are null and null values 
count in
+ * the column chunk is greater than 0.
+ *
+ * @return true if column has single value
+ */
 @Override
 public boolean hasSingleValue() {
-  return (minValue !=null && maxValue != null && 
minValue.equals(maxValue));
+  if (nulls != null) {
--- End diff --

Yes, we should apply it to ColumnMetadata_v1, thanks.
In ColumnMetadata_v2 mxValue is set only in the case when min value is the 
same as the max value. So we could not be sure that if mxValue == null and 
nulls count is greater than zero, column has single value.

Also I have changed the result for the second case described in your 
previous comment. Statistics [1] for most parquet types use java primitive 
types to store min and max values, so min/max can not be null even if the table 
has null values. So I removed check nulls == 0.

[1] 
https://github.com/apache/parquet-mr/tree/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/statistics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-13 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r111501238
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+<@pp.dropOutputFile />
+
+<#list cast.types as type>
+<#if type.major == "VarBinaryInterval">  <#-- Template to convert from 
VarBinary to Interval, IntervalYear, IntervalDay -->
+
+<@pp.changeOutputFile 
name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java"
 />
+
+<#include "/@includes/license.ftl" />
+
+package org.apache.drill.exec.expr.fn.impl.gcast;
+
+import io.netty.buffer.ByteBuf;
+
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import 
org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import org.apache.drill.exec.expr.holders.IntervalDayHolder;
+import org.apache.drill.exec.expr.holders.IntervalHolder;
+import org.apache.drill.exec.expr.holders.IntervalYearHolder;
+import org.apache.drill.exec.expr.holders.VarBinaryHolder;
+
+/*
+ * This class is generated using freemarker and the ${.template_name} 
template.
+ */
+@SuppressWarnings("unused")
+@FunctionTemplate(name = "cast${type.to?upper_case}",
+  scope = FunctionTemplate.FunctionScope.SIMPLE,
+  nulls = NullHandling.NULL_IF_NULL)
+public class Cast${type.from}To${type.to} implements DrillSimpleFunc {
+
+  @Param ${type.from}Holder in;
+  @Output ${type.to}Holder out;
+
+  public void setup() {
+  }
+
+  public void eval() {
+final byte[] buf = new byte[in.end - in.start];
+in.buffer.getBytes(in.start, buf, 0, in.end - in.start);
+  <#if type.to == "Interval">
+out.months = 
org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 
0);
--- End diff --

I'm a bit confused.  Are you assuming VarBinary is encoded as 3 int values 
: # of months, # of days and # of millisecon?  If varbinary is encoded in such 
way, which is parquet used to encode logical type "interval", why do we have to 
cast ? 

My understanding is varbinary looks similar to varchar, except that the 
former contains byte while the latter contains character.   

 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-13 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r111505132
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -508,7 +511,10 @@ public void populatePruningVector(ValueVector v, int 
index, SchemaPath column, S
 NullableVarBinaryVector varBinaryVector = 
(NullableVarBinaryVector) v;
 Object s = partitionValueMap.get(f).get(column);
 byte[] bytes;
-if (s instanceof Binary) {
+if (s == null) {
+  varBinaryVector.getMutator().setNull(index);
--- End diff --

Should we apply this setNull() to all other types? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-13 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r111504302
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1668,9 +1666,25 @@ public ColumnMetadata_v3(String[] name, 
PrimitiveTypeName primitiveType, Object
   return nulls;
 }
 
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if minValue and maxValue are the same, but not null,
+ * and all column chunk values are not null.
+ * Returns true if minValue and maxValue are null and null values 
count in
+ * the column chunk is greater than 0.
+ *
+ * @return true if column has single value
+ */
 @Override
 public boolean hasSingleValue() {
-  return (minValue !=null && maxValue != null && 
minValue.equals(maxValue));
+  if (nulls != null) {
--- End diff --

Should this applies to v2 and v1 of ColumnMetadata? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-05 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r109618786
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+<@pp.dropOutputFile />
+
+<#list cast.types as type>
+<#if type.major == "VarBinaryInterval">  <#-- Template to convert from 
VarBinary to Interval, IntervalYear, IntervalDay -->
+
+<@pp.changeOutputFile 
name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java"
 />
+
+<#include "/@includes/license.ftl" />
+
+package org.apache.drill.exec.expr.fn.impl.gcast;
+
+import io.netty.buffer.ByteBuf;
+
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import 
org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import org.apache.drill.exec.expr.holders.IntervalDayHolder;
+import org.apache.drill.exec.expr.holders.IntervalHolder;
+import org.apache.drill.exec.expr.holders.IntervalYearHolder;
+import org.apache.drill.exec.expr.holders.VarBinaryHolder;
+
+/*
+ * This class is generated using freemarker and the ${.template_name} 
template.
+ */
+@SuppressWarnings("unused")
+@FunctionTemplate(name = "cast${type.to?upper_case}",
+  scope = FunctionTemplate.FunctionScope.SIMPLE,
+  nulls = NullHandling.NULL_IF_NULL)
+public class Cast${type.from}To${type.to} implements DrillSimpleFunc {
+
+  @Param ${type.from}Holder in;
+  @Output ${type.to}Holder out;
+
+  public void setup() {
+  }
+
+  public void eval() {
+final byte[] buf = new byte[in.end - in.start];
+in.buffer.getBytes(in.start, buf, 0, in.end - in.start);
+  <#if type.to == "Interval">
+out.months = 
org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 
0);
--- End diff --

1. We are using this function to receive int from bytes in little endian 
encoding. VarBinary data may come from other data sources.
2. No, we can't. Interval types are stored as three int values in binary 
form in varbinary vector. 
3. Varchar representation of interval types contains special literals. 
Inside UDF which casts varchar to interval JodaTime function parses varchar 
string. So it does not make sense to apply the change to existing function 
implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-05 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r109908576
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1668,9 +1666,16 @@ public ColumnMetadata_v3(String[] name, 
PrimitiveTypeName primitiveType, Object
   return nulls;
 }
 
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if minValue and maxValue are null and
+ * column statistics was available.
+ *
+ * @return true if column has single value
+ */
 @Override
 public boolean hasSingleValue() {
-  return (minValue !=null && maxValue != null && 
minValue.equals(maxValue));
+  return minValue != null ? minValue.equals(maxValue) : (maxValue == 
null && isMetadataSet());
--- End diff --

1. It is clear that we should just compare minValue and maxValue.
2. In this case we should check that nulls == 0. Fixed that. Now for this 
case function returns false.
3. We should additionally check that nulls > 0. Fixed that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r109532543
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+<@pp.dropOutputFile />
+
+<#list cast.types as type>
+<#if type.major == "VarBinaryInterval">  <#-- Template to convert from 
VarBinary to Interval, IntervalYear, IntervalDay -->
+
+<@pp.changeOutputFile 
name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java"
 />
+
+<#include "/@includes/license.ftl" />
+
+package org.apache.drill.exec.expr.fn.impl.gcast;
+
+import io.netty.buffer.ByteBuf;
+
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.annotations.FunctionTemplate;
+import 
org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling;
+import org.apache.drill.exec.expr.annotations.Output;
+import org.apache.drill.exec.expr.annotations.Param;
+import org.apache.drill.exec.expr.holders.IntervalDayHolder;
+import org.apache.drill.exec.expr.holders.IntervalHolder;
+import org.apache.drill.exec.expr.holders.IntervalYearHolder;
+import org.apache.drill.exec.expr.holders.VarBinaryHolder;
+
+/*
+ * This class is generated using freemarker and the ${.template_name} 
template.
+ */
+@SuppressWarnings("unused")
+@FunctionTemplate(name = "cast${type.to?upper_case}",
+  scope = FunctionTemplate.FunctionScope.SIMPLE,
+  nulls = NullHandling.NULL_IF_NULL)
+public class Cast${type.from}To${type.to} implements DrillSimpleFunc {
+
+  @Param ${type.from}Holder in;
+  @Output ${type.to}Holder out;
+
+  public void setup() {
+  }
+
+  public void eval() {
+final byte[] buf = new byte[in.end - in.start];
+in.buffer.getBytes(in.start, buf, 0, in.end - in.start);
+  <#if type.to == "Interval">
+out.months = 
org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 
0);
--- End diff --

Why are we using a function from Parquet library to do the conversion from 
varbinary to interval? Do we assume that the varbinary data always come from a 
parquet data source?

The current code base has implementation for cast from varchar to interval 
(CastVarCharInterval.java).  Can we leverage the existing implementation here? 

If you feel the new approach (using parquet library's function) is better, 
does it make sense to apply the change to existing implementation of varchar -> 
interval?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...

2017-04-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/805#discussion_r109536003
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -1668,9 +1666,16 @@ public ColumnMetadata_v3(String[] name, 
PrimitiveTypeName primitiveType, Object
   return nulls;
 }
 
+/**
+ * Checks that the column chunk has single value.
+ * Returns true if minValue and maxValue are null and
+ * column statistics was available.
+ *
+ * @return true if column has single value
+ */
 @Override
 public boolean hasSingleValue() {
-  return (minValue !=null && maxValue != null && 
minValue.equals(maxValue));
+  return minValue != null ? minValue.equals(maxValue) : (maxValue == 
null && isMetadataSet());
--- End diff --

We probably need better definition of "column has single value", before go 
ahead and modify this function.  Cases we need consider:

1) minValue / maxValue != null and there is no "null" in the column.
2) minValue / maxValue != null and there is "null" in the column. 
3) minValue/maxValue == null, and there is "null" in the column




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #805: Drill 4139: Exception while trying to prune partiti...

2017-04-03 Thread vvysotskyi
GitHub user vvysotskyi opened a pull request:

https://github.com/apache/drill/pull/805

Drill 4139: Exception while trying to prune partition. 
java.lang.UnsupportedOperationException: Unsupported type: BIT & Interval



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vvysotskyi/drill DRILL-4139

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #805


commit 14d5d50b072c891db248eba9b093ea63a32fa678
Author: Przemyslaw Maciolek 
Date:   2016-05-31T18:46:53Z

DRILL-4139: Add missing BIT support for Paruet partition pruning

commit bbf37ec96b5bd885bb69e6d47f838355b99fad23
Author: Volodymyr Vysotskyi 
Date:   2017-03-24T16:02:07Z

DRILL-4139: Add missing Interval, VarBinary and Varchar with nulls 
partition pruning support, unit tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---