[
https://issues.apache.org/jira/browse/BEAM-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596233#comment-16596233
]
Tim Robertson edited comment on BEAM-5036 at 8/29/18 12:03 PM:
---------------------------------------------------------------
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories
if missing, but also surfaces an exception if the underlying operation reports
the operation did not complete.
This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
to hdfs://ha-nn/tmp/es-2012.txt-00000-of-00045. No further information
provided by underlying filesystem.
at
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code}
The original implementation using copy() would overwrite files without warning.
Do we wish to silently overwrite files when issuing a rename()? I am used to
Hadoop operations failing if the output already exists so for me it sounds
wrong - I'd rather be forced to delete manually than accidentally be able to
overwrite TBs of data.
was (Author: timrobertson100):
The changes (yet to be merged) to rename() in BEAM-4861 now creates directories
if missing, but also surfaces an exception if the underlying operation reports
the operation did not complete.
This means it will fail with exception if the target file already exists:
{code}
Caused by: java.io.IOException: Unable to rename resource
hdfs://ha-nn/tmp/delme/.temp-beam-2018-08-29_11-41-47-0/1d676ec2-787d-4357-838f-f904e8d57b3d
to hdfs://ha-nn/tmp/es-2012.txt-00000-of-00045. No further information
provided by underlying filesystem.
at
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.rename(HadoopFileSystem.java:181)
at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:326)
at
org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:761)
at
org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:801)
{code}
The original implementation using copy() would overwrite files without warning.
Do we wish to silently overwrite files when issuing a rename()?
> Optimize FileBasedSink's WriteOperation.moveToOutput()
> ------------------------------------------------------
>
> Key: BEAM-5036
> URL: https://issues.apache.org/jira/browse/BEAM-5036
> Project: Beam
> Issue Type: Improvement
> Components: io-java-files
> Affects Versions: 2.5.0
> Reporter: Jozef Vilcek
> Assignee: Tim Robertson
> Priority: Major
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> moveToOutput() methods in FileBasedSink.WriteOperation implements move by
> copy+delete. It would be better to use a rename() which can be much more
> effective for some filesystems.
> Filesystem must support cross-directory rename. BEAM-4861 is related to this
> for the case of HDFS filesystem.
> Feature was discussed here:
> http://mail-archives.apache.org/mod_mbox/beam-dev/201807.mbox/%3CCAF9t7_4Mp54pQ+vRrJrBh9Vx0=uaknupzd_qdh_qdm9vxll...@mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)