[
https://issues.apache.org/jira/browse/HIVE-26657?focusedWorklogId=819259&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-819259
]
ASF GitHub Bot logged work on HIVE-26657:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 21/Oct/22 20:06
Start Date: 21/Oct/22 20:06
Worklog Time Spent: 10m
Work Description: szlta commented on code in PR #3695:
URL: https://github.com/apache/hive/pull/3695#discussion_r1002148004
##########
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveTableUtil.java:
##########
@@ -125,7 +125,7 @@ private static List<DataFile>
getDataFiles(RemoteIterator<LocatedFileStatus> fil
while (fileStatusIterator.hasNext()) {
LocatedFileStatus fileStatus = fileStatusIterator.next();
String fileName = fileStatus.getPath().getName();
- if (fileName.startsWith(".") || fileName.startsWith("_")) {
+ if (fileName.startsWith(".") || fileName.startsWith("_") ||
fileName.endsWith("metadata.json")) {
Review Comment:
So metadata files are handled, but are snapshot files (that are avro
formatted) going to cause similar issues?
Issue Time Tracking
-------------------
Worklog Id: (was: 819259)
Time Spent: 40m (was: 0.5h)
> [Iceberg] Filter out the metadata.json file when migrating
> -----------------------------------------------------------
>
> Key: HIVE-26657
> URL: https://issues.apache.org/jira/browse/HIVE-26657
> Project: Hive
> Issue Type: Bug
> Reporter: László Pintér
> Assignee: László Pintér
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> When migrating a hive table to an iceberg in certain cases a Runtime
> exception is raised
> {code:java}
> ERROR : Failed
> java.lang.RuntimeException:
> s3a://dev-nfqe-base/cc-cdw-nfqe-q7wj9a/archive/env-8pt556/parquet/bakeoff/large/pli/metadata/00000-94fffe5c-c307-4341-9ea3-f5fa4863d301.metadata.json
> is not a Parquet file. Expected magic number at tail, but found [32, 93, 10,
> 125]
> {code}
> The hive-to-iceberg table migration has the following logic.
> 1. In order to walk through all the data files we request a file iterator
> from the filesystem. This iterator will provide all the references to be able
> to scan the data files.
> 2. The new iceberg table is created, meaning that a new entry is added to the
> hive catalog and on the file system level the metadata directory is created
> together with the first metadata file (*.metadata.json)
> 3. All the data files are scanned and the manifests are created.
> The issue occurs when there are so many data files that it doesn't fit into
> memory in one go. So in step 3 when we walk through the data files list, the
> iterator has to run another round of file listing that reads up the content
> of the metadata directory that was created in step 2.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)