[
https://issues.apache.org/jira/browse/FLINK-10203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670237#comment-16670237
]
ASF GitHub Bot commented on FLINK-10203:
----------------------------------------
art4ul commented on issue #6608: [FLINK-10203]Support truncate method for old
Hadoop versions in HadoopRecoverableFsDataOutputStream
URL: https://github.com/apache/flink/pull/6608#issuecomment-434730831
@kl0u @StephanEwen
Hi guys,
Regarding your question:
> - Does HDFS permit to rename to an already existing file name (replacing
that existing file)?
I've double checked it. HDFS has no ability to move file with overwriting.
But this Pull request resolves this issue.
In case of failure after restarting the ‘truncate’ method check if the
original file exists:
- If the original file exists - start the process from the beginning.
- If original file not exists but exists the file with '*.truncated'
extension. The absence of original file tells us about that truncated file was
written fully and deleted . The source crushed on the stage of renaming the
truncated file. We can use file with '*.truncated' extension as a resultant
file and finish the truncation process.
Also, I would like to clarify your idea regarding recoverable writer with
"Recover for resume" property.
As far as I understand in this approach if Hadoop version greater than 2.7
we going to instantiate recoverable writer with native Hadoop truncate logic
and method supportsResume() should return true. Otherwise, we instantiate
recoverable writer which never use truncate method (only create new files) and
the method supportsResume() should return false.
If you ok with this approach I can prepare another pull request. But in this
case, I need to wait when the logic which check the supportsResume method will
be implemented.
Maybe I could help you with it ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Support truncate method for old Hadoop versions in
> HadoopRecoverableFsDataOutputStream
> --------------------------------------------------------------------------------------
>
> Key: FLINK-10203
> URL: https://issues.apache.org/jira/browse/FLINK-10203
> Project: Flink
> Issue Type: Bug
> Components: DataStream API, filesystem-connector
> Affects Versions: 1.6.0, 1.6.1, 1.7.0
> Reporter: Artsem Semianenka
> Assignee: Artsem Semianenka
> Priority: Major
> Labels: pull-request-available
> Attachments: legacy truncate logic.pdf
>
>
> New StreamingFileSink ( introduced in 1.6 Flink version ) use
> HadoopRecoverableFsDataOutputStream wrapper to write data in HDFS.
> HadoopRecoverableFsDataOutputStream is a wrapper for FSDataOutputStream to
> have an ability to restore from certain point of file after failure and
> continue write data. To achieve this recover functionality the
> HadoopRecoverableFsDataOutputStream use "truncate" method which was
> introduced only in Hadoop 2.7 .
> Unfortunately there are a few official Hadoop distributive which latest
> version still use Hadoop 2.6 (This distributives: Cloudera, Pivotal HD ). As
> the result Flinks Hadoop connector can't work with this distributives.
> Flink declares that supported Hadoop from version 2.4.0 upwards
> ([https://ci.apache.org/projects/flink/flink-docs-release-1.6/start/building.html#hadoop-versions])
> I guess we should emulate the functionality of "truncate" method for older
> Hadoop versions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)