Re: 回复： Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring

Till Rohrmann Tue, 15 May 2018 08:28:10 -0700

Hi Xinyu,

would it help to have a small tool which can truncate the finished files
which have a valid-length file associated? That way, one could use this
tool before others are using the data farther down stream.


Cheers,
Till

On Tue, May 15, 2018 at 3:05 PM, Xinyu Zhang <342689...@qq.com> wrote:

> Yes, I'm glad to do it. but I'm not sure writing a new file is a good
> solution. So I want to discuss it here.
> Do you have any ideas? @Kostas
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "twalthr"<twal...@apache.org>;
> 发送时间: 2018年5月15日(星期二) 晚上8:21
> 收件人: "Xinyu Zhang"<342689...@qq.com>;
> 抄送: "dev"<dev@flink.apache.org>; "kkloudas"<kklou...@apache.org>;
> 主题: Re: 回复： Rewriting a new file instead of writing a ".valid-length" file
> inBucketSink when restoring
>
>
>
> As far as I know, the bucketing sink is currenlty also limited by
> relying on Hadoops file system abstraction. It is planned to switch to
> Flink's file system abstraction which might also improve this situation.
> Kostas (in CC) might know more about it.
>
> But I think we can discuss if an other behavior should be configurable
> as well. Would you be willing to contribute?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 14:01 schrieb Xinyu Zhang:
> > Thanks for your reply.
> > Indeed, if a file is very large, it will take a long time. However,
> > the the “.valid-length” file is not convenient for others who use the
> > data in HDFS.
> > Maybe we should provide a configuration for users to choose which
> > strategy they prefer.
> > Do you have any ideas?
> >
> >
> > ------------------ 原始邮件 ------------------
> > *发件人:* "Timo Walther"<twal...@apache.org>;
> > *发送时间:* 2018年5月15日(星期二) 晚上7:30
> > *收件人:* "dev"<dev@flink.apache.org>;
> > *主题:* Re: Rewriting a new file instead of writing a ".valid-length"
> > file inBucketSink when restoring
> >
> > I guess writing a new file would take much longer than just using the
> > .valid-length file, especially if the files are very large. The
> > restoring time should be as minimal as possible to ensure little
> > downtime on restarts.
> >
> > Regards,
> > Timo
> >
> >
> > Am 15.05.18 um 09:31 schrieb Gary Yao:
> > > Hi,
> > >
> > > The BucketingSink truncates the file if the Hadoop FileSystem
> > supports this
> > > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you
> > using?
> > >
> > > Best,
> > > Gary
> > >
> > > [1]
> > >
> > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196
> b2fdaf85e0/flink-connectors/flink-connector-filesystem/
> src/main/java/org/apache/flink/streaming/connectors/fs/
> bucketing/BucketingSink.java#L301
> > >
> > > On Mon, May 14, 2018 at 1:37 PM, 张馨予 <342689...@qq.com> wrote:
> > >
> > >> Hi
> > >>
> > >>
> > >> I'm trying to copy data from kafka to HDFS . The data in HDFS is
> > used to
> > >> do other computations by others in map/reduce.
> > >> If some tasks failed, the ".valid-length" file is created for the low
> > >> version hadoop. The problem is other people must know how to deal
> > with the
> > >> ".valid-length" file, otherwise, the data may be not exactly-once.
> > >> Hence, why not rewrite a new file when restoring instead of writing a
> > >> ".valid-length" file. In this way, others who use the data in HDFS
> > don't
> > >> need to know how to deal with the ".valid-length" file.
> > >>
> > >>
> > >> Thanks!
> > >>
> > >>
> > >> Zhang Xinyu
> >
>

Re: 回复： Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring

Reply via email to