[
https://issues.apache.org/jira/browse/DRILL-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080025#comment-16080025
]
Volodymyr Vysotskyi commented on DRILL-4139:
--------------------------------------------
Drill serializes values of binary fields to parquet metadata cache file using
the code {{new String(((Binary) bytes).getBytes())}}
but when bytes has encoding that differs from default, for example it has
little-endian byte order, then {{new String(((Binary)
bytes).getBytes()).getBytes()}}
would return byte array that differs from the {{bytes}}.
According to [Parquet Logical Type
Definitions|https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md],
big-endian byte order should be used to store DECIMAL values in
fixed_len_byte_array or binary field. INTERVAL type uses little-endian byte
order to store its value in fixed_len_byte_array field.
Drill stores correctly only values of binary fields in parquet metadata cache
file, but values of fixed_len_byte_array fields are storing as Binary objects:
{noformat}
{
"name" : [ "col_intrvl_yr" ],
"minValue" : {
"bytesUnsafe" : "sQAAAAAAAAAAAAAA",
"bytes" : "sQAAAAAAAAAAAAAA",
"backingBytesReused" : true
},
"maxValue" : {
"bytesUnsafe" : "OgEAAAAAAAAAAAAA",
"bytes" : "OgEAAAAAAAAAAAAA",
"backingBytesReused" : true
},
"nulls" : 0
}
{noformat}
Since Drill may store some types in binary and fixed_len_byte_array fields, it
is required to serialize / deserialize both these types by the same way. For
example according to [Parquet Logical Type
Definitions|https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md],
DECIMAL field may be stored as binary or fixed_len_byte_array field.
Proposal is to serialize byte arrays directly by calling {{((Binary)
value.minValue).getBytes()}} and deserialize by calling
{{Base64.decodeBase64(((String) source).getBytes())}}.
So there will be no dependence on the byte order.
Another problem is backward compatibility. When metadata file, that created by
the version of Drill with these changes will be read from older Drill version,
it may lead to errors or wrong results. Updating the metadata version does not
help, since old Drill versions just throws an exception when is trying to read
new metadata cache files:
{noformat}
Error: SYSTEM ERROR: JsonMappingException: Could not resolve type id 'v4' into
a subtype of [simple type, class
org.apache.drill.exec.store.parquet.Metadata$ParquetTableMetadataBase]: known
type ids = [Metadata$ParquetTableMetadataBase, v1, v2, v3]
at [Source:
org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream@7b609ce0;
line: 2, column: 24]
{noformat}
Metadata cache files without and with changes for DRILL-4139 attached to the
Jira.
Drill version with changes for this Jira allows to read parquet table metadata
cache with version v3 and older.
Drill 1.10.0 will throw an exception when it will try to read parquet table
metadata cache with version v4 and greater.
> Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
> -----------------------------------------------------------------
>
> Key: DRILL-4139
> URL: https://issues.apache.org/jira/browse/DRILL-4139
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.3.0
> Environment: 4 node cluster on CentOS
> Reporter: Khurram Faraaz
> Assignee: Volodymyr Vysotskyi
>
> Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
> is seen in drillbit.log after Functional run on 4 node cluster.
> Drill 1.3.0 sys.version => d61bb83a8
> {code}
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO
> o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning
> class: org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO
> o.a.d.e.p.l.partition.PruneScanRule - Total elapsed time to build and analyze
> filter tree: 0 ms
> 2015-11-27 03:12:19,810 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] WARN
> o.a.d.e.p.l.partition.PruneScanRule - Exception while trying to prune
> partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
> at
> org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:479)
> ~[drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96)
> ~[drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:235)
> ~[drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
> [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
> at
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
> [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
> at
> org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
> [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
> at
> org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:303)
> [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
> at
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.logicalPlanningVolcanoAndLopt(DefaultSqlHandler.java:545)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:213)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:248)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:164)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:184)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:905)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:244)
> [drill-java-exec-1.3.0.jar:1.3.0]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_45]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_45]
> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)