[
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517086#comment-17517086
]
ASF GitHub Bot commented on PARQUET-2006:
-----------------------------------------
rdblue commented on code in PR #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r842097255
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -878,11 +880,97 @@ public String getFile() {
return blocks;
}
- public void setRequestedSchema(MessageType projection) {
+ private boolean uniqueId(GroupType schema, HashSet<Type.ID> ids) {
+ boolean unique = true;
+ List<Type> fields = schema.getFields();
+ for (Type field : fields) {
+ if (field instanceof PrimitiveType) {
+ Type.ID id = field.getId();
+ if (id != null) {
+ if (ids.contains(id)) {
+ return false;
+ }
+ ids.add(id);
+ }
+ }
+
+ if (field instanceof GroupType) {
+ Type.ID id = field.getId();
+ if (id != null) {
+ if (ids.contains(id)) {
+ return false;
+ }
+ ids.add(id);
+ }
+ if (unique) unique = uniqueId(field.asGroupType(), ids);
+ }
+ }
+ return unique;
+ }
+
+ public MessageType setRequestedSchema(MessageType projection, boolean
useColumnId) {
paths.clear();
- for (ColumnDescriptor col : projection.getColumns()) {
+ MessageType schema = null;
+ if (useColumnId) {
+ HashSet<Type.ID> ids = new HashSet<>();
+ boolean fileSchemaIdUnique = uniqueId(fileMetaData.getSchema(), ids);
+ if (!fileSchemaIdUnique) {
+ throw new RuntimeException("can't use column id resolution because
there are duplicate column ids.");
+ }
+ ids = new HashSet<>();
+ boolean projectionSchemaIdUnique = uniqueId(projection, ids);
+ if (!projectionSchemaIdUnique) {
+ throw new RuntimeException("can't use column id resolution because
there are duplicate column ids.");
+ }
+ schema = resetColumnNameBasedOnId(projection);
+ } else {
+ schema = projection;
+ }
+ for (ColumnDescriptor col : schema.getColumns()) {
paths.put(ColumnPath.get(col.getPath()), col);
}
+ return schema;
+ }
+
+ private MessageType resetColumnNameBasedOnId(MessageType schema) {
Review Comment:
This doesn't seem like the right class to contain these utility methods. I'd
recommend a utility class to handle this.
> Column resolution by ID
> -----------------------
>
> Key: PARQUET-2006
> URL: https://issues.apache.org/jira/browse/PARQUET-2006
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Xinli Shang
> Assignee: Xinli Shang
> Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this
> would be a problem. Iceberg uses ID and stored Id/name mappings.
> This Jira is to add column ID resolution support.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)