[
https://issues.apache.org/jira/browse/BEAM-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593406#comment-16593406
]
Tim Robertson edited comment on BEAM-4861 at 8/27/18 10:03 AM:
---------------------------------------------------------------
The {{HadoopFileSystem}} has the following methods:
{code:java}
@Override
protected void copy(List<HadoopResourceId> srcResourceIds,
List<HadoopResourceId> destResourceIds)
throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
// Unfortunately HDFS FileSystems don't support a native copy operation
so we are forced
// to use the inefficient implementation found in FileUtil which copies
all the bytes through
// the local machine.
//
// HDFS FileSystem does define a concat method but could only find the
DFSFileSystem
// implementing it. The DFSFileSystem implemented concat by deleting the
srcs after which
// is not what we want. Also, all the other FileSystem implementations I
saw threw
// UnsupportedOperationException within concat.
FileUtil.copy(
fileSystem,
srcResourceIds.get(i).toPath(),
fileSystem,
destResourceIds.get(i).toPath(),
false,
true,
fileSystem.getConf());
}
}
@Override
protected void rename(
List<HadoopResourceId> srcResourceIds, List<HadoopResourceId>
destResourceIds)
throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
fileSystem.rename(srcResourceIds.get(i).toPath(),
destResourceIds.get(i).toPath());
}
}
@Override
protected void delete(Collection<HadoopResourceId> resourceIds) throws
IOException {
for (HadoopResourceId resourceId : resourceIds) {
fileSystem.delete(resourceId.toPath(), false);
}
}
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all
return false indicating that the operation was not performed.
*1. Informing the user of the unsuccessful operation*
We could either:
# Change the Beam {{FileSystem}} API to propagate this, although the [rules
for HDFS
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
are not trivial to document and this might prove to be invasive in many places.
# Throw an {{IOException}} to signal that the operation was not successful if
the response is false?
I tend towards suggesting leaving the API to return void but throw an exception
on the first case of error within the loops - thoughts?
*2. rename() in HDFS*
What do we believe are the expectations for {{rename()}} on HDFS?
Currently the user is not informed that nothing happens if an attempt to rename
a file into a non existent directory is made. This is obviously bad.
We could change behaviour to one of:
# Throw exception if the directory does not exist
# Create the directory where necessary, letting files be overridden if it does
exist (equivalent of e.g. {{S3Filesystem}})
# Verify that the directory does not exist, and only then create it and
proceed, otherwise alerting with Exception (the usual behaviour of a
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with
"directory already exists").
Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}}
and {{delete()}} operation internally.
I tend towards creating the directory where necessary allowing for overwriting
- thoughts?
CC [~reuvenlax] as relates to BEAM-5036 as well.
was (Author: timrobertson100):
The {{HadoopFileSystem}} has the following methods:
{code:java}
@Override
protected void copy(List<HadoopResourceId> srcResourceIds,
List<HadoopResourceId> destResourceIds)
throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
// Unfortunately HDFS FileSystems don't support a native copy operation
so we are forced
// to use the inefficient implementation found in FileUtil which copies
all the bytes through
// the local machine.
//
// HDFS FileSystem does define a concat method but could only find the
DFSFileSystem
// implementing it. The DFSFileSystem implemented concat by deleting the
srcs after which
// is not what we want. Also, all the other FileSystem implementations I
saw threw
// UnsupportedOperationException within concat.
FileUtil.copy(
fileSystem,
srcResourceIds.get(i).toPath(),
fileSystem,
destResourceIds.get(i).toPath(),
false,
true,
fileSystem.getConf());
}
}
@Override
protected void rename(
List<HadoopResourceId> srcResourceIds, List<HadoopResourceId>
destResourceIds)
throws IOException {
for (int i = 0; i < srcResourceIds.size(); ++i) {
fileSystem.rename(srcResourceIds.get(i).toPath(),
destResourceIds.get(i).toPath());
}
}
@Override
protected void delete(Collection<HadoopResourceId> resourceIds) throws
IOException {
for (HadoopResourceId resourceId : resourceIds) {
fileSystem.delete(resourceId.toPath(), false);
}
}
{code}
{{FileUtil.copy}}, {{fileSystem.rename}} and {{fileSystem.delete}} can all
return false indicating that the operation was not performed.
*1. Informing the user of the unsuccessful operation*
We could either:
# Change the Beam {{FileSystem}} API to propagate this, although the [rules
for HDFS
rename()|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d]
are not trivial to document and this might prove to be invasive in many places.
# Throw an {{IOException}} to signal that the operation was not successful if
the response is false?
I tend towards suggesting leaving the API to return void but throw an exception
on the first case of error within the loops - thoughts?
*2. rename() in HDFS*
What do we believe are the expectations for {{rename()}} on HDFS?
Currently the user is not informed if an attempt to rename a file into a non
existent directory is made. This is obviously bad.
We could change behaviour to one of:
# Throw exception if the directory does not exist
# Create the directory where necessary, letting files be overridden if it does
exist (equivalent of e.g. {{S3Filesystem}})
# Verify that the directory does not exist, and only then create it and
proceed, otherwise alerting with Exception (the usual behaviour of a
{{MapReduce FileOutputFormat}} at job startup where it quickly fails with
"directory already exists").
Note that {{S3FileSystem}} and {{GcsFileSystem}} treat a rename as a {{copy()}}
and {{delete()}} operation internally.
I tend towards creating the directory where necessary allowing for overwriting
- thoughts?
CC [~reuvenlax] as relates to BEAM-5036 as well.
> Hadoop Filesystem silently fails
> --------------------------------
>
> Key: BEAM-4861
> URL: https://issues.apache.org/jira/browse/BEAM-4861
> Project: Beam
> Issue Type: Bug
> Components: io-java-hadoop
> Reporter: Jozef Vilcek
> Assignee: Chamikara Jayalath
> Priority: Major
>
> Hi,
> beam Filesystem operations copy, rename and delete are void in SDK. Hadoop
> native filesystem operations are not and returns void. Current implementation
> in Beam ignores the result and pass as long as exception is not thrown.
> I got burned by this when using 'rename' to do a 'move' operation on HDFS. If
> target directory does not exists, operations returns false and do not touch
> the file.
> [https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L148]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)