[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/805 ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r140057495 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1054,8 +1057,36 @@ public void setMax(Object max) { return nulls; } -@Override public boolean hasSingleValue() { - return (max != null && min != null && max.equals(min)); +/** + * Checks that the column chunk has a single value. + * Returns {@code true} if {@code min} and {@code max} are the same but not null + * and nulls count is 0 or equal to the rows count. + * + * Returns {@code true} if {@code min} and {@code max} are null and the number of null values + * in the column chunk is equal to the rows count. + * + * Comparison of nulls and rows count is needed for the cases: + * + * column with primitive type has single value and null values + * + * column with primitive type has only null values, min/max couldn't be null, + * but column has single value + * + * + * @param rowCount rows count in column chunk + * @return true if column has single value + */ +@Override +public boolean hasSingleValue(long rowCount) { + if (nulls != null) { +if (min != null) { + // Objects.deepEquals() is used here, since min and max may be byte arrays + return Objects.deepEquals(min, max) && (nulls == 0 || nulls == rowCount); --- End diff -- Statistics [1] for most parquet types use java primitive types to store min and max values, so min/max can not be null even if the table has null values. [1] https://github.com/apache/parquet-mr/tree/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/statistics ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r140055857 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1054,8 +1057,36 @@ public void setMax(Object max) { return nulls; } -@Override public boolean hasSingleValue() { - return (max != null && min != null && max.equals(min)); +/** + * Checks that the column chunk has a single value. + * Returns {@code true} if {@code min} and {@code max} are the same but not null + * and nulls count is 0 or equal to the rows count. + * + * Returns {@code true} if {@code min} and {@code max} are null and the number of null values + * in the column chunk is equal to the rows count. + * + * Comparison of nulls and rows count is needed for the cases: + * + * column with primitive type has single value and null values + * + * column with primitive type has only null values, min/max couldn't be null, + * but column has single value + * + * + * @param rowCount rows count in column chunk + * @return true if column has single value + */ +@Override +public boolean hasSingleValue(long rowCount) { + if (nulls != null) { +if (min != null) { + // Objects.deepEquals() is used here, since min and max may be byte arrays + return Objects.deepEquals(min, max) && (nulls == 0 || nulls == rowCount); --- End diff -- - if (min != null), then nulls cannot be equal to rowCount - In this case, only nulls == 0 should be checked ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r133927602 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarCharVector varCharVector = (NullableVarCharVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type - bytes = ((String) s).getBytes(); -} else if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varCharVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } + case INTERVAL: { +NullableIntervalVector intervalVector = (NullableIntervalVector) v; +Object s = partitionValueMap.get(f).get(column); +byte[] bytes; +if (s == null) { + intervalVector.getMutator().setNull(index); + return; +} else { + bytes = getBytes(type, s); +} +intervalVector.getMutator().setSafe(index, 1, + ParquetReaderUtility.getIntFromLEBytes(bytes, 0), + ParquetReaderUtility.getIntFromLEBytes(bytes, 4), + ParquetReaderUtility.getIntFromLEBytes(bytes, 8)); +return; + } default: throw new UnsupportedOperationException("Unsupported type: " + type); } } + /** + * Returns the sequence of bytes received from {@code Object source}. + * + * @param type the column type + * @param source the source of the bytes sequence + * @return bytes sequence obtained from {@code Object source} + */ + private byte[] getBytes(MinorType type, Object source) { +byte[] bytes; +if (source instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type + bytes = Base64.decodeBase64(((String) source).getBytes()); --- End diff -- > Note, however, that the casting is unfortunate. The caller knows the type. Better to have: The caller does not know the type since `source` is received from the map of objects. So the casting should be used there. > Seems we'd also need a to/from Base64 method so that, on read, we do: Value vector mutator interface does not have a method that allows passing bytes. Such methods implemented in the concrete implementation of mutators, so we could not create such to/from Base64 method. > Strings need not be Base64 encoded They should be Base64 encoded since byte array may have an encoding that differs from the UTF-8. I moved this conversion from the string to a byte array into the `ParquetReaderUtility.correctBinaryInMetadataCache()` method. So now we are considering the metadata version and chose the way how to encode the string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126637301 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath schemaPath) { public void populatePruningVector(ValueVector v, int index, SchemaPath column, String file) { String f = Path.getPathWithoutSchemeAndAuthority(new Path(file)).toString(); -MinorType type = getTypeForColumn(column).getMinorType(); +MajorType majorType = getTypeForColumn(column); +MinorType type = majorType.getMinorType(); switch (type) { + case BIT: { +NullableBitVector bitVector = (NullableBitVector) v; +Boolean value = (Boolean) partitionValueMap.get(f).get(column); +if (value == null) { + bitVector.getMutator().setNull(index); +} else { + bitVector.getMutator().setSafe(index, value ? 1 : 0); +} +return; + } case INT: { NullableIntVector intVector = (NullableIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case SMALLINT: { NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -smallIntVector.getMutator().setSafe(index, value.shortValue()); +if (value == null) { + smallIntVector.getMutator().setNull(index); +} else { + smallIntVector.getMutator().setSafe(index, value.shortValue()); +} return; } case TINYINT: { NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -tinyIntVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + tinyIntVector.getMutator().setNull(index); +} else { + tinyIntVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT1: { NullableUInt1Vector intVector = (NullableUInt1Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT2: { NullableUInt2Vector intVector = (NullableUInt2Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, (char) value.shortValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, (char) value.shortValue()); +} return; } case UINT4: { NullableUInt4Vector intVector = (NullableUInt4Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case BIGINT: { NullableBigIntVector bigIntVector = (NullableBigIntVector) v; Long value = (Long) partitionValueMap.get(f).get(column); -bigIntVector.getMutator().setSafe(index, value); +if (value == null) { + bigIntVector.getMutator().setNull(index); +} else { + bigIntVector.getMutator().setSafe(index, value); +} return; } case FLOAT4: { NullableFloat4Vector float4Vector = (NullableFloat4Vector) v; Float value = (Float) partitionValueMap.get(f).get(column); -float4Vector.getMutator().setSafe(index, value); +if (value == null) { + float4Vector.getMutator().setNull(index); +} else { + float4Vector.getMutator().setSafe(index, value); +} return; } case FLOAT8: { NullableFloat8Vector float8Vector =
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r133933369 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +186,101 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file has version less than + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ --- End diff -- `ColumnMetadata_v1` class serializes / deserializes `primitiveType` field so it does not hard to check that field has `BINARY` type. But `ColumnMetadata_v2` and `ColumnMetadata_v3` does not serialize this field. They have classes `ColumnTypeMetadata_v2` and `ColumnTypeMetadata_v3` respectively which contain information about the field name, primitive and original types. In this method, we are receiving fields with `BINARY` type taking information about the field type from `ColumnTypeMetadata_v2` or `ColumnTypeMetadata_v3`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126640837 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath schemaPath) { public void populatePruningVector(ValueVector v, int index, SchemaPath column, String file) { String f = Path.getPathWithoutSchemeAndAuthority(new Path(file)).toString(); -MinorType type = getTypeForColumn(column).getMinorType(); +MajorType majorType = getTypeForColumn(column); +MinorType type = majorType.getMinorType(); switch (type) { + case BIT: { +NullableBitVector bitVector = (NullableBitVector) v; +Boolean value = (Boolean) partitionValueMap.get(f).get(column); +if (value == null) { + bitVector.getMutator().setNull(index); +} else { + bitVector.getMutator().setSafe(index, value ? 1 : 0); +} +return; + } case INT: { NullableIntVector intVector = (NullableIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case SMALLINT: { NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -smallIntVector.getMutator().setSafe(index, value.shortValue()); +if (value == null) { + smallIntVector.getMutator().setNull(index); +} else { + smallIntVector.getMutator().setSafe(index, value.shortValue()); +} return; } case TINYINT: { NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -tinyIntVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + tinyIntVector.getMutator().setNull(index); +} else { + tinyIntVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT1: { NullableUInt1Vector intVector = (NullableUInt1Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT2: { NullableUInt2Vector intVector = (NullableUInt2Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, (char) value.shortValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, (char) value.shortValue()); +} return; } case UINT4: { NullableUInt4Vector intVector = (NullableUInt4Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case BIGINT: { NullableBigIntVector bigIntVector = (NullableBigIntVector) v; Long value = (Long) partitionValueMap.get(f).get(column); -bigIntVector.getMutator().setSafe(index, value); +if (value == null) { + bigIntVector.getMutator().setNull(index); +} else { + bigIntVector.getMutator().setSafe(index, value); +} return; } case FLOAT4: { NullableFloat4Vector float4Vector = (NullableFloat4Vector) v; Float value = (Float) partitionValueMap.get(f).get(column); -float4Vector.getMutator().setSafe(index, value); +if (value == null) { + float4Vector.getMutator().setNull(index); +} else { + float4Vector.getMutator().setSafe(index, value); +} return; } case FLOAT8: { NullableFloat8Vector float8Vector =
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126565179 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath schemaPath) { public void populatePruningVector(ValueVector v, int index, SchemaPath column, String file) { String f = Path.getPathWithoutSchemeAndAuthority(new Path(file)).toString(); -MinorType type = getTypeForColumn(column).getMinorType(); +MajorType majorType = getTypeForColumn(column); +MinorType type = majorType.getMinorType(); switch (type) { + case BIT: { +NullableBitVector bitVector = (NullableBitVector) v; +Boolean value = (Boolean) partitionValueMap.get(f).get(column); +if (value == null) { + bitVector.getMutator().setNull(index); +} else { + bitVector.getMutator().setSafe(index, value ? 1 : 0); +} +return; + } case INT: { NullableIntVector intVector = (NullableIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case SMALLINT: { NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -smallIntVector.getMutator().setSafe(index, value.shortValue()); +if (value == null) { + smallIntVector.getMutator().setNull(index); +} else { + smallIntVector.getMutator().setSafe(index, value.shortValue()); +} return; } case TINYINT: { NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -tinyIntVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + tinyIntVector.getMutator().setNull(index); +} else { + tinyIntVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT1: { NullableUInt1Vector intVector = (NullableUInt1Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT2: { NullableUInt2Vector intVector = (NullableUInt2Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, (char) value.shortValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, (char) value.shortValue()); +} return; } case UINT4: { NullableUInt4Vector intVector = (NullableUInt4Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case BIGINT: { NullableBigIntVector bigIntVector = (NullableBigIntVector) v; Long value = (Long) partitionValueMap.get(f).get(column); -bigIntVector.getMutator().setSafe(index, value); +if (value == null) { + bigIntVector.getMutator().setNull(index); +} else { + bigIntVector.getMutator().setSafe(index, value); +} return; } case FLOAT4: { NullableFloat4Vector float4Vector = (NullableFloat4Vector) v; Float value = (Float) partitionValueMap.get(f).get(column); -float4Vector.getMutator().setSafe(index, value); +if (value == null) { + float4Vector.getMutator().setNull(index); +} else { + float4Vector.getMutator().setSafe(index, value); +} return; } case FLOAT8: { NullableFloat8Vector float8Vector =
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126568096 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +186,101 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file has version less than + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ --- End diff -- This is very hard to follow for those of us not familiar with the details of the metadata file format. Please include a comment in the PR conversation that shows an example of the old and new formats. That will make it easy to double-check that the logic makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126567484 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarCharVector varCharVector = (NullableVarCharVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type - bytes = ((String) s).getBytes(); -} else if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varCharVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } + case INTERVAL: { +NullableIntervalVector intervalVector = (NullableIntervalVector) v; +Object s = partitionValueMap.get(f).get(column); +byte[] bytes; +if (s == null) { + intervalVector.getMutator().setNull(index); + return; +} else { + bytes = getBytes(type, s); +} +intervalVector.getMutator().setSafe(index, 1, + ParquetReaderUtility.getIntFromLEBytes(bytes, 0), + ParquetReaderUtility.getIntFromLEBytes(bytes, 4), + ParquetReaderUtility.getIntFromLEBytes(bytes, 8)); +return; + } default: throw new UnsupportedOperationException("Unsupported type: " + type); } } + /** + * Returns the sequence of bytes received from {@code Object source}. + * + * @param type the column type + * @param source the source of the bytes sequence + * @return bytes sequence obtained from {@code Object source} + */ + private byte[] getBytes(MinorType type, Object source) { +byte[] bytes; +if (source instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type + bytes = Base64.decodeBase64(((String) source).getBytes()); --- End diff -- The data here should be assumed to be binary. Binary can have any characters, even JSON symbols. For example: "value:" : "look a quote",\n"bogus" : { 10, "foo" }" The above will make a mash of any JSON structure. Better to encode binary data as Base 64, which is designed to fit within a text format. Note, however, that the casting is unfortunate. The caller knows the type. Better to have: byte[] getBytes(String value) ... byte[] getBytes(byte[] value) ... byte[] getBytes(Binary value) ... {code} Seems we'd also need a to/from Base64 method so that, on read, we do: * Get string value. * Pass string value to Base64 decoder, get bytes * Pass bytes to vector, using vector's code to understand byte meaning On encoding: * Get the vector value as a byte array * Base64 encode the array * Write it to JSON as a string All that said, Strings need not be Base64 encoded if we have a JSON encoder that will escape JSON special characters. This works as long as the JSON file itself is UTF-8 encoded because then each Unicode string has a UTF-8 equivalent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126564587 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -444,123 +478,235 @@ public MajorType getTypeForColumn(SchemaPath schemaPath) { public void populatePruningVector(ValueVector v, int index, SchemaPath column, String file) { String f = Path.getPathWithoutSchemeAndAuthority(new Path(file)).toString(); -MinorType type = getTypeForColumn(column).getMinorType(); +MajorType majorType = getTypeForColumn(column); +MinorType type = majorType.getMinorType(); switch (type) { + case BIT: { +NullableBitVector bitVector = (NullableBitVector) v; +Boolean value = (Boolean) partitionValueMap.get(f).get(column); +if (value == null) { + bitVector.getMutator().setNull(index); +} else { + bitVector.getMutator().setSafe(index, value ? 1 : 0); +} +return; + } case INT: { NullableIntVector intVector = (NullableIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case SMALLINT: { NullableSmallIntVector smallIntVector = (NullableSmallIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -smallIntVector.getMutator().setSafe(index, value.shortValue()); +if (value == null) { + smallIntVector.getMutator().setNull(index); +} else { + smallIntVector.getMutator().setSafe(index, value.shortValue()); +} return; } case TINYINT: { NullableTinyIntVector tinyIntVector = (NullableTinyIntVector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -tinyIntVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + tinyIntVector.getMutator().setNull(index); +} else { + tinyIntVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT1: { NullableUInt1Vector intVector = (NullableUInt1Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value.byteValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value.byteValue()); +} return; } case UINT2: { NullableUInt2Vector intVector = (NullableUInt2Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, (char) value.shortValue()); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, (char) value.shortValue()); +} return; } case UINT4: { NullableUInt4Vector intVector = (NullableUInt4Vector) v; Integer value = (Integer) partitionValueMap.get(f).get(column); -intVector.getMutator().setSafe(index, value); +if (value == null) { + intVector.getMutator().setNull(index); +} else { + intVector.getMutator().setSafe(index, value); +} return; } case BIGINT: { NullableBigIntVector bigIntVector = (NullableBigIntVector) v; Long value = (Long) partitionValueMap.get(f).get(column); -bigIntVector.getMutator().setSafe(index, value); +if (value == null) { + bigIntVector.getMutator().setNull(index); +} else { + bigIntVector.getMutator().setSafe(index, value); +} return; } case FLOAT4: { NullableFloat4Vector float4Vector = (NullableFloat4Vector) v; Float value = (Float) partitionValueMap.get(f).get(column); -float4Vector.getMutator().setSafe(index, value); +if (value == null) { + float4Vector.getMutator().setNull(index); +} else { + float4Vector.getMutator().setSafe(index, value); +} return; } case FLOAT8: { NullableFloat8Vector float8Vector =
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126443099 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java --- @@ -445,12 +448,148 @@ public void testMetadataCacheAbsolutePaths() throws Exception { test("drop table if exists %s", RELATIVE_PATHS_METADATA); } } + @Test // DRILL-4139 + public void testBooleanPartitionPruning() throws Exception { --- End diff -- Thanks, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r126438246 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1008,8 +1008,24 @@ public void setMax(Object max) { return nulls; } -@Override public boolean hasSingleValue() { - return (max != null && min != null && max.equals(min)); +/** + * Checks that the column chunk has single value. + * Returns true if min and max are the same, but not null. + * Returns true if min and max are null and the number of null values + * in the column chunk is greater than 0. + * + * @return true if column has single value --- End diff -- Thanks for such detailed analysis. This bug is also reproduced on master version, but it appears due to the wrong result of this method. Added a comparison of `nulls` with rows count to fix this bug and allow to return true for the case when field has only nulls. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r125785209 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java --- @@ -445,12 +448,148 @@ public void testMetadataCacheAbsolutePaths() throws Exception { test("drop table if exists %s", RELATIVE_PATHS_METADATA); } } + @Test // DRILL-4139 + public void testBooleanPartitionPruning() throws Exception { --- End diff -- In addition to "=" comparison, can you please try with "is null" / "is not null" predicate, since one of this PR's objectives is to fix partition pruning dealing null values. I saw another incorrect query result, although I'm not sure if it's a bug in the base code, or a bug in this PR. ``` create table dfs.tmp.`t6/a` as select col_notexist as mykey from cp.`tpch/region.parquet`; create table dfs.tmp.`t6/b` as select 100 as mykey from cp.`tpch/region.parquet`; select mykey from dfs.tmp.t6; ++ | mykey | ++ | null | | null | | null | | null | | null | | 100| | 100| | 100| | 100| | 100| ++ select mykey from dfs.tmp.t6 where mykey is null; ++ | mykey | ++ ++ ``` Notice that the last query returns 0 row, where it should return 5 rows. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r125784283 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1696,9 +1712,24 @@ public ColumnMetadata_v3(String[] name, PrimitiveTypeName primitiveType, Object return nulls; } +/** + * Checks that the column chunk has single value. + * Returns true if minValue and maxValue are the same, but not null. + * Returns true if minValue and maxValue are null and the number of null values + * in the column chunk is greater than 0. + * + * @return true if column has single value + */ @Override public boolean hasSingleValue() { - return (minValue !=null && maxValue != null && minValue.equals(maxValue)); + if (nulls != null) { +if (minValue != null) { + return minValue.equals(maxValue); --- End diff -- Same comment as above. I think in both v1 & v3, we should check nulls = 0 ( there is no nulls in the column). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r125784066 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1008,8 +1008,24 @@ public void setMax(Object max) { return nulls; } -@Override public boolean hasSingleValue() { - return (max != null && min != null && max.equals(min)); +/** + * Checks that the column chunk has single value. + * Returns true if min and max are the same, but not null. + * Returns true if min and max are null and the number of null values + * in the column chunk is greater than 0. + * + * @return true if column has single value --- End diff -- My understanding is hasSingleValue() return true if the column meta data shows only one single value. A null value is also counted as a different value from other non-null value. Therefore, for the case of column has min != null && max !=null && min.equals(max) && nulls!=null && nulls > 0, it should return false. However, in both the implementation of v1 and v3, it would return true. That would actually lead to wrong query result. A simple reproduce: ``` create table dfs.tmp.`t5/a` as select 100 as mykey from cp.`tpch/nation.parquet` union all select col_notexist from cp.`tpch/region.parquet`; create table dfs.tmp.`t5/b` as select 200 as mykey from cp.`tpch/nation.parquet` union all select col_notexist from cp.`tpch/region.parquet`; ``` We got two files, each having one single unique non-null value, plus null values. Now query the two files: ``` select mykey from dfs.tmp.`t5` where mykey = 100; ++ | mykey | ++ | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | 100| | null | | null | | null | | null | | null | ++ 30 rows selected (0.246 seconds) ``` Apparently, those 5 nulls should not be returned. I applied the 3 commits in this PR on top of today's master branch. ``` select * from sys.version; +--+---+---++-++ | version | commit_id | commit_message | commit_time | build_email | build_time | +--+---+---++-++ | 1.11.0-SNAPSHOT | cad6e4dc950aa4a95ad20515ce5abd9c546d3e5d | DRILL-4139: Fix loss of scale value for DECIMAL in parquet partition pruning | 05.07.2017 @ 12:05:25 PDT | j...@apache.org | 05.07.2017 @ 12:06:07 PDT | +--+---+- ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r118204069 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java --- @@ -398,10 +399,115 @@ public void testDrill4877() throws Exception { } + @Test // DRILL-4139 + public void testBooleanPartitionPruning() throws Exception { +final String boolPartitionTable = "dfs_test.tmp.`interval_bool_partition`"; +try { + test("create table %s partition by (col_bln) as " + +"select * from cp.`parquet/alltypes_required.parquet`", boolPartitionTable); + test("refresh table metadata %s", boolPartitionTable); --- End diff -- Thanks, modified tests to cover those cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117748249 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarCharVector varCharVector = (NullableVarCharVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type - bytes = ((String) s).getBytes(); -} else if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varCharVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } + case INTERVAL: { +NullableIntervalVector intervalVector = (NullableIntervalVector) v; +Object s = partitionValueMap.get(f).get(column); +byte[] bytes; +if (s == null) { + intervalVector.getMutator().setNull(index); + return; +} else { + bytes = getBytes(type, s); +} +intervalVector.getMutator().setSafe(index, 1, + ParquetReaderUtility.getIntFromLEBytes(bytes, 0), + ParquetReaderUtility.getIntFromLEBytes(bytes, 4), + ParquetReaderUtility.getIntFromLEBytes(bytes, 8)); +return; + } default: throw new UnsupportedOperationException("Unsupported type: " + type); } } + /** + * Returns the sequence of bytes received from {@code Object source}. + * + * @param type the column type + * @param source the source of the bytes sequence + * @return bytes sequence obtained from {@code Object source} + */ + private byte[] getBytes(MinorType type, Object source) { +byte[] bytes; +if (source instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type + bytes = Base64.decodeBase64(((String) source).getBytes()); --- End diff -- It is necessary for the cases when value has encoding order other than UTF-8. When creating string from the byte array for example with little-endian byte order without manually setting the charset, getBytes() would return byte array that differs from the passed into the String constructor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r118203458 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less than + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ + public static void correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase parquetTableMetadata) { +if (hasOldBinarySerialization(parquetTableMetadata)) { + Setnames = Sets.newHashSet(); + if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2) { --- End diff -- Added check for ParquetTableMetadata_v3, thanks. In class ColumnMetadata_v1 field `primitiveType` is serialized to the metadata cache file, so it is easier to check column type for this metadata version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117750872 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less than + * the version where was changed the serialization of BINARY values --- End diff -- Yes, it is happening only in the metadata cache file. Serialization was wrong at least for interval and decimal types, since they encoded using little-endian and big-endian byte orders. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117737970 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof String) { - bytes = ((String) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varBinaryVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varBinaryVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } case DECIMAL18: { NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) v; -Long value = (Long) partitionValueMap.get(f).get(column); +Object s = partitionValueMap.get(f).get(column); +byte[] bytes; +if (s == null) { + decimalVector.getMutator().setNull(index); + return; +} else if (s instanceof Integer) { + decimalVector.getMutator().setSafe(index, (Integer) s); + return; +} else if (s instanceof Long) { + decimalVector.getMutator().setSafe(index, (Long) s); + return; +} else { + bytes = getBytes(type, s); --- End diff -- Bytes will be received from partitionValueMap when decimal values are stored in parquet file as fixed_len_byte_array or binary. There are unit test that covers this case and cases when decimal values are stored as int32 and int64. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117736018 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof String) { - bytes = ((String) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varBinaryVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varBinaryVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } case DECIMAL18: { NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) v; -Long value = (Long) partitionValueMap.get(f).get(column); +Object s = partitionValueMap.get(f).get(column); --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r118203553 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less than + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ + public static void correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase parquetTableMetadata) { +if (hasOldBinarySerialization(parquetTableMetadata)) { + Setnames = Sets.newHashSet(); + if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2) { +for (Metadata.ColumnTypeMetadata_v2 columnTypeMetadata : +((Metadata.ParquetTableMetadata_v2) parquetTableMetadata).columnTypeInfo.values()) { + if (columnTypeMetadata.primitiveType == PrimitiveTypeName.BINARY) { +names.add(Arrays.asList(columnTypeMetadata.name)); + } +} + } + for (Metadata.ParquetFileMetadata file : parquetTableMetadata.getFiles()) { +// Drill has only ever written a single row group per file, only need to correct the statistics +// on the first row group +Metadata.RowGroupMetadata rowGroupMetadata = file.getRowGroups().get(0); --- End diff -- Thanks, fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117582970 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof String) { - bytes = ((String) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varBinaryVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varBinaryVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } case DECIMAL18: { NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) v; -Long value = (Long) partitionValueMap.get(f).get(column); +Object s = partitionValueMap.get(f).get(column); --- End diff -- If the patch also changes DECIMAL18's partition pruning, please modify the title of JIRA to reflect such change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117583804 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less than + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ + public static void correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase parquetTableMetadata) { +if (hasOldBinarySerialization(parquetTableMetadata)) { + Setnames = Sets.newHashSet(); + if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2) { +for (Metadata.ColumnTypeMetadata_v2 columnTypeMetadata : +((Metadata.ParquetTableMetadata_v2) parquetTableMetadata).columnTypeInfo.values()) { + if (columnTypeMetadata.primitiveType == PrimitiveTypeName.BINARY) { +names.add(Arrays.asList(columnTypeMetadata.name)); + } +} + } + for (Metadata.ParquetFileMetadata file : parquetTableMetadata.getFiles()) { +// Drill has only ever written a single row group per file, only need to correct the statistics +// on the first row group +Metadata.RowGroupMetadata rowGroupMetadata = file.getRowGroups().get(0); --- End diff -- It's true that parquet files created by Drill have single RG per file. But the metadata file could be created from parquet files from other source. Such assumption may not be true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117585451 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java --- @@ -398,10 +399,115 @@ public void testDrill4877() throws Exception { } + @Test // DRILL-4139 + public void testBooleanPartitionPruning() throws Exception { +final String boolPartitionTable = "dfs_test.tmp.`interval_bool_partition`"; +try { + test("create table %s partition by (col_bln) as " + +"select * from cp.`parquet/alltypes_required.parquet`", boolPartitionTable); + test("refresh table metadata %s", boolPartitionTable); --- End diff -- We probably want to cover the cases where metadata cache file is not created (run the query before call "refresh table metadata"). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117584586 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less than + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ + public static void correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase parquetTableMetadata) { +if (hasOldBinarySerialization(parquetTableMetadata)) { + Setnames = Sets.newHashSet(); + if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2) { --- End diff -- Any reason you only check v2 here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117583114 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -508,21 +516,32 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof String) { - bytes = ((String) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varBinaryVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varBinaryVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } case DECIMAL18: { NullableDecimal18Vector decimalVector = (NullableDecimal18Vector) v; -Long value = (Long) partitionValueMap.get(f).get(column); +Object s = partitionValueMap.get(f).get(column); +byte[] bytes; +if (s == null) { + decimalVector.getMutator().setNull(index); + return; +} else if (s instanceof Integer) { + decimalVector.getMutator().setSafe(index, (Integer) s); + return; +} else if (s instanceof Long) { + decimalVector.getMutator().setSafe(index, (Long) s); + return; +} else { + bytes = getBytes(type, s); --- End diff -- For DECIMAL18, under what kind scenarios, would we get a bytes from partitionValueMap? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117584380 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less than + * the version where was changed the serialization of BINARY values --- End diff -- Can you explain a bit more about why the serialization of BINARY values are wrong in prior version? Is it only happening in the metadata cache file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r117578486 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -548,23 +567,57 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarCharVector varCharVector = (NullableVarCharVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type - bytes = ((String) s).getBytes(); -} else if (s instanceof Binary) { - bytes = ((Binary) s).getBytes(); -} else if (s instanceof byte[]) { - bytes = (byte[]) s; +if (s == null) { + varCharVector.getMutator().setNull(index); + return; } else { - throw new UnsupportedOperationException("Unable to create column data for type: " + type); + bytes = getBytes(type, s); } varCharVector.getMutator().setSafe(index, bytes, 0, bytes.length); return; } + case INTERVAL: { +NullableIntervalVector intervalVector = (NullableIntervalVector) v; +Object s = partitionValueMap.get(f).get(column); +byte[] bytes; +if (s == null) { + intervalVector.getMutator().setNull(index); + return; +} else { + bytes = getBytes(type, s); +} +intervalVector.getMutator().setSafe(index, 1, + ParquetReaderUtility.getIntFromLEBytes(bytes, 0), + ParquetReaderUtility.getIntFromLEBytes(bytes, 4), + ParquetReaderUtility.getIntFromLEBytes(bytes, 8)); +return; + } default: throw new UnsupportedOperationException("Unsupported type: " + type); } } + /** + * Returns the sequence of bytes received from {@code Object source}. + * + * @param type the column type + * @param source the source of the bytes sequence + * @return bytes sequence obtained from {@code Object source} + */ + private byte[] getBytes(MinorType type, Object source) { +byte[] bytes; +if (source instanceof String) { // if the metadata was read from a JSON cache file it maybe a string type + bytes = Base64.decodeBase64(((String) source).getBytes()); --- End diff -- Any reason you call Base4.decodeBase64, in stead of calling String.getBytes()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user arina-ielchiieva commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r112660688 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -181,6 +185,71 @@ else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 && } /** + * Checks that the metadata file was created by drill with version less that + * the version where was changed the serialization of BINARY values + * and assigns byte arrays to min/max values obtained from the deserialized string. + * + * @param parquetTableMetadata table metadata that should be corrected + */ + public static void correctBinaryInMetadataCache(Metadata.ParquetTableMetadataBase parquetTableMetadata) { +if (hasOldBinarySerialization(parquetTableMetadata)) { + Setnames = Sets.newHashSet(); + if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2) { +for (Metadata.ColumnTypeMetadata_v2 columnTypeMetadata : +((Metadata.ParquetTableMetadata_v2) parquetTableMetadata).columnTypeInfo.values()) { + if (columnTypeMetadata.primitiveType == PrimitiveTypeName.BINARY) { +names.add(Arrays.asList(columnTypeMetadata.name)); + } +} + } + for (Metadata.ParquetFileMetadata file : parquetTableMetadata.getFiles()) { +// Drill has only ever written a single row group per file, only need to correct the statistics +// on the first row group +Metadata.RowGroupMetadata rowGroupMetadata = file.getRowGroups().get(0); +for (Metadata.ColumnMetadata columnMetadata : rowGroupMetadata.getColumns()) { + // Setting Min/Max values for ParquetTableMetadata_v1 + if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v1 +|| parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v3) { +PrimitiveTypeName primitiveType = columnMetadata.getPrimitiveType(); +if (primitiveType == PrimitiveTypeName.BINARY && columnMetadata.hasSingleValue()) { + Object minValue = columnMetadata.getMinValue(); + if (minValue != null && minValue instanceof String) { +byte[] bytes = ((String) minValue).getBytes(); +columnMetadata.setMax(bytes); +columnMetadata.setMin(bytes); + } +} + } + // Setting Max values for ParquetTableMetadata_v2 + else if (parquetTableMetadata instanceof Metadata.ParquetTableMetadata_v2 +&& columnMetadata.hasSingleValue() +&& names.contains(Arrays.asList(columnMetadata.getName( { +Object maxValue = columnMetadata.getMaxValue(); +if (maxValue != null && maxValue instanceof String) { + byte[] bytes = ((String) maxValue).getBytes(); + columnMetadata.setMax(bytes); +} + } +} + } +} + } + + /** + * Checks that the metadata file was created by drill with version less that + * the version where was changed the serialization of BINARY values. + * + * @param parquetTableMetadata the source of drill version + * @return true if metadata file was created by drill with version less that + * the version where was changed the serialization of BINARY values + */ + public static boolean hasOldBinarySerialization(Metadata.ParquetTableMetadataBase parquetTableMetadata) { +String drillVersion = parquetTableMetadata.getDrillVersion(); +return drillVersion == null +|| new ComparableVersion(drillVersion).compareTo(new ComparableVersion("1.11.0-SNAPSHOT")) < 0; --- End diff -- I am not sure that using version with SNAPSHOT is correct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r112651025 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -508,7 +511,10 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof Binary) { +if (s == null) { + varBinaryVector.getMutator().setNull(index); --- End diff -- We can apply this setNull() only for types that use BinaryStatistics [1] for the reason described in my previous comment. So we apply setNull() to VARBINARY, DECIMAL18, VARCHAR and INTERVAL types. [1] https://github.com/apache/parquet-mr/blob/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java#L25 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r112649665 --- Diff: exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> + +<#list cast.types as type> +<#if type.major == "VarBinaryInterval"> <#-- Template to convert from VarBinary to Interval, IntervalYear, IntervalDay --> + +<@pp.changeOutputFile name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.drill.exec.expr.fn.impl.gcast; + +import io.netty.buffer.ByteBuf; + +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.IntervalDayHolder; +import org.apache.drill.exec.expr.holders.IntervalHolder; +import org.apache.drill.exec.expr.holders.IntervalYearHolder; +import org.apache.drill.exec.expr.holders.VarBinaryHolder; + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +@FunctionTemplate(name = "cast${type.to?upper_case}", + scope = FunctionTemplate.FunctionScope.SIMPLE, + nulls = NullHandling.NULL_IF_NULL) +public class Cast${type.from}To${type.to} implements DrillSimpleFunc { + + @Param ${type.from}Holder in; + @Output ${type.to}Holder out; + + public void setup() { + } + + public void eval() { +final byte[] buf = new byte[in.end - in.start]; +in.buffer.getBytes(in.start, buf, 0, in.end - in.start); + <#if type.to == "Interval"> +out.months = org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 0); --- End diff -- I made conclusion that Interval type encoded in this way when found that IntervalVector stores it's value in DrillBuf data as 3 int values. But I had seen that there are a lot of other types encoded as you described, so you are right, thanks for the clarification. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r112650691 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1668,9 +1666,25 @@ public ColumnMetadata_v3(String[] name, PrimitiveTypeName primitiveType, Object return nulls; } +/** + * Checks that the column chunk has single value. + * Returns true if minValue and maxValue are the same, but not null, + * and all column chunk values are not null. + * Returns true if minValue and maxValue are null and null values count in + * the column chunk is greater than 0. + * + * @return true if column has single value + */ @Override public boolean hasSingleValue() { - return (minValue !=null && maxValue != null && minValue.equals(maxValue)); + if (nulls != null) { --- End diff -- Yes, we should apply it to ColumnMetadata_v1, thanks. In ColumnMetadata_v2 mxValue is set only in the case when min value is the same as the max value. So we could not be sure that if mxValue == null and nulls count is greater than zero, column has single value. Also I have changed the result for the second case described in your previous comment. Statistics [1] for most parquet types use java primitive types to store min and max values, so min/max can not be null even if the table has null values. So I removed check nulls == 0. [1] https://github.com/apache/parquet-mr/tree/e54ca615f213f5db6d34d9163c97eec98920d7a7/parquet-column/src/main/java/org/apache/parquet/column/statistics --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r111501238 --- Diff: exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> + +<#list cast.types as type> +<#if type.major == "VarBinaryInterval"> <#-- Template to convert from VarBinary to Interval, IntervalYear, IntervalDay --> + +<@pp.changeOutputFile name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.drill.exec.expr.fn.impl.gcast; + +import io.netty.buffer.ByteBuf; + +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.IntervalDayHolder; +import org.apache.drill.exec.expr.holders.IntervalHolder; +import org.apache.drill.exec.expr.holders.IntervalYearHolder; +import org.apache.drill.exec.expr.holders.VarBinaryHolder; + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +@FunctionTemplate(name = "cast${type.to?upper_case}", + scope = FunctionTemplate.FunctionScope.SIMPLE, + nulls = NullHandling.NULL_IF_NULL) +public class Cast${type.from}To${type.to} implements DrillSimpleFunc { + + @Param ${type.from}Holder in; + @Output ${type.to}Holder out; + + public void setup() { + } + + public void eval() { +final byte[] buf = new byte[in.end - in.start]; +in.buffer.getBytes(in.start, buf, 0, in.end - in.start); + <#if type.to == "Interval"> +out.months = org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 0); --- End diff -- I'm a bit confused. Are you assuming VarBinary is encoded as 3 int values : # of months, # of days and # of millisecon? If varbinary is encoded in such way, which is parquet used to encode logical type "interval", why do we have to cast ? My understanding is varbinary looks similar to varchar, except that the former contains byte while the latter contains character. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r111505132 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java --- @@ -508,7 +511,10 @@ public void populatePruningVector(ValueVector v, int index, SchemaPath column, S NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) v; Object s = partitionValueMap.get(f).get(column); byte[] bytes; -if (s instanceof Binary) { +if (s == null) { + varBinaryVector.getMutator().setNull(index); --- End diff -- Should we apply this setNull() to all other types? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r111504302 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1668,9 +1666,25 @@ public ColumnMetadata_v3(String[] name, PrimitiveTypeName primitiveType, Object return nulls; } +/** + * Checks that the column chunk has single value. + * Returns true if minValue and maxValue are the same, but not null, + * and all column chunk values are not null. + * Returns true if minValue and maxValue are null and null values count in + * the column chunk is greater than 0. + * + * @return true if column has single value + */ @Override public boolean hasSingleValue() { - return (minValue !=null && maxValue != null && minValue.equals(maxValue)); + if (nulls != null) { --- End diff -- Should this applies to v2 and v1 of ColumnMetadata? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r109618786 --- Diff: exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> + +<#list cast.types as type> +<#if type.major == "VarBinaryInterval"> <#-- Template to convert from VarBinary to Interval, IntervalYear, IntervalDay --> + +<@pp.changeOutputFile name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.drill.exec.expr.fn.impl.gcast; + +import io.netty.buffer.ByteBuf; + +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.IntervalDayHolder; +import org.apache.drill.exec.expr.holders.IntervalHolder; +import org.apache.drill.exec.expr.holders.IntervalYearHolder; +import org.apache.drill.exec.expr.holders.VarBinaryHolder; + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +@FunctionTemplate(name = "cast${type.to?upper_case}", + scope = FunctionTemplate.FunctionScope.SIMPLE, + nulls = NullHandling.NULL_IF_NULL) +public class Cast${type.from}To${type.to} implements DrillSimpleFunc { + + @Param ${type.from}Holder in; + @Output ${type.to}Holder out; + + public void setup() { + } + + public void eval() { +final byte[] buf = new byte[in.end - in.start]; +in.buffer.getBytes(in.start, buf, 0, in.end - in.start); + <#if type.to == "Interval"> +out.months = org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 0); --- End diff -- 1. We are using this function to receive int from bytes in little endian encoding. VarBinary data may come from other data sources. 2. No, we can't. Interval types are stored as three int values in binary form in varbinary vector. 3. Varchar representation of interval types contains special literals. Inside UDF which casts varchar to interval JodaTime function parses varchar string. So it does not make sense to apply the change to existing function implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user vvysotskyi commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r109908576 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1668,9 +1666,16 @@ public ColumnMetadata_v3(String[] name, PrimitiveTypeName primitiveType, Object return nulls; } +/** + * Checks that the column chunk has single value. + * Returns true if minValue and maxValue are null and + * column statistics was available. + * + * @return true if column has single value + */ @Override public boolean hasSingleValue() { - return (minValue !=null && maxValue != null && minValue.equals(maxValue)); + return minValue != null ? minValue.equals(maxValue) : (maxValue == null && isMetadataSet()); --- End diff -- 1. It is clear that we should just compare minValue and maxValue. 2. In this case we should check that nulls == 0. Fixed that. Now for this case function returns false. 3. We should additionally check that nulls > 0. Fixed that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r109532543 --- Diff: exec/java-exec/src/main/codegen/templates/CastVarBinaryInterval.java --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> + +<#list cast.types as type> +<#if type.major == "VarBinaryInterval"> <#-- Template to convert from VarBinary to Interval, IntervalYear, IntervalDay --> + +<@pp.changeOutputFile name="/org/apache/drill/exec/expr/fn/impl/gcast/Cast${type.from}To${type.to}.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.drill.exec.expr.fn.impl.gcast; + +import io.netty.buffer.ByteBuf; + +import org.apache.drill.exec.expr.DrillSimpleFunc; +import org.apache.drill.exec.expr.annotations.FunctionTemplate; +import org.apache.drill.exec.expr.annotations.FunctionTemplate.NullHandling; +import org.apache.drill.exec.expr.annotations.Output; +import org.apache.drill.exec.expr.annotations.Param; +import org.apache.drill.exec.expr.holders.IntervalDayHolder; +import org.apache.drill.exec.expr.holders.IntervalHolder; +import org.apache.drill.exec.expr.holders.IntervalYearHolder; +import org.apache.drill.exec.expr.holders.VarBinaryHolder; + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +@FunctionTemplate(name = "cast${type.to?upper_case}", + scope = FunctionTemplate.FunctionScope.SIMPLE, + nulls = NullHandling.NULL_IF_NULL) +public class Cast${type.from}To${type.to} implements DrillSimpleFunc { + + @Param ${type.from}Holder in; + @Output ${type.to}Holder out; + + public void setup() { + } + + public void eval() { +final byte[] buf = new byte[in.end - in.start]; +in.buffer.getBytes(in.start, buf, 0, in.end - in.start); + <#if type.to == "Interval"> +out.months = org.apache.drill.exec.store.parquet.ParquetReaderUtility.getIntFromLEBytes(buf, 0); --- End diff -- Why are we using a function from Parquet library to do the conversion from varbinary to interval? Do we assume that the varbinary data always come from a parquet data source? The current code base has implementation for cast from varchar to interval (CastVarCharInterval.java). Can we leverage the existing implementation here? If you feel the new approach (using parquet library's function) is better, does it make sense to apply the change to existing implementation of varchar -> interval? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill-4139: Exception while trying to prune partiti...
Github user jinfengni commented on a diff in the pull request: https://github.com/apache/drill/pull/805#discussion_r109536003 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -1668,9 +1666,16 @@ public ColumnMetadata_v3(String[] name, PrimitiveTypeName primitiveType, Object return nulls; } +/** + * Checks that the column chunk has single value. + * Returns true if minValue and maxValue are null and + * column statistics was available. + * + * @return true if column has single value + */ @Override public boolean hasSingleValue() { - return (minValue !=null && maxValue != null && minValue.equals(maxValue)); + return minValue != null ? minValue.equals(maxValue) : (maxValue == null && isMetadataSet()); --- End diff -- We probably need better definition of "column has single value", before go ahead and modify this function. Cases we need consider: 1) minValue / maxValue != null and there is no "null" in the column. 2) minValue / maxValue != null and there is "null" in the column. 3) minValue/maxValue == null, and there is "null" in the column --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #805: Drill 4139: Exception while trying to prune partiti...
GitHub user vvysotskyi opened a pull request: https://github.com/apache/drill/pull/805 Drill 4139: Exception while trying to prune partition. java.lang.UnsupportedOperationException: Unsupported type: BIT & Interval You can merge this pull request into a Git repository by running: $ git pull https://github.com/vvysotskyi/drill DRILL-4139 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/805.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #805 commit 14d5d50b072c891db248eba9b093ea63a32fa678 Author: Przemyslaw MaciolekDate: 2016-05-31T18:46:53Z DRILL-4139: Add missing BIT support for Paruet partition pruning commit bbf37ec96b5bd885bb69e6d47f838355b99fad23 Author: Volodymyr Vysotskyi Date: 2017-03-24T16:02:07Z DRILL-4139: Add missing Interval, VarBinary and Varchar with nulls partition pruning support, unit tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---