Re: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring
Hi Till Thanks for your suggestion. A small tool can work lightly and asynchronously. However, I don't know when others will use the data, so I should use the tool to check and truncate the finished file once a valid-length file is found. I think it's hard to maintain it and it shouldn't be maintained by users (just like the current implementation of BucketingSink with truncate function). Regards, Zhang Xinyu -- -- ??: "Till Rohrmann"; : 2018??5??15??(??) 11:27 ??: "dev" ; : "kkloudas" ; : Re: ?? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring Hi Xinyu, would it help to have a small tool which can truncate the finished files which have a valid-length file associated? That way, one could use this tool before others are using the data farther down stream. Cheers, Till On Tue, May 15, 2018 at 3:05 PM, Xinyu Zhang <342689...@qq.com> wrote: > Yes, I'm glad to do it. but I'm not sure writing a new file is a good > solution. So I want to discuss it here. > Do you have any ideas? @Kostas > > > > > -- -- > ??: "twalthr" ; > : 2018??5??15??(??) 8:21 > ??: "Xinyu Zhang"<342689...@qq.com>; > : "dev" ; "kkloudas" ; > : Re: ?? Rewriting a new file instead of writing a ".valid-length" > file > inBucketSink when restoring > > > > As far as I know, the bucketing sink is currenlty also limited by > relying on Hadoops file system abstraction. It is planned to switch to > Flink's file system abstraction which might also improve this situation. > Kostas (in CC) might know more about it. > > But I think we can discuss if an other behavior should be configurable > as well. Would you be willing to contribute? > > Regards, > Timo > > > Am 15.05.18 um 14:01 schrieb Xinyu Zhang: > > Thanks for your reply. > > Indeed, if a file is very large, it will take a long time. However, > > the the ??.valid-length?? file is not convenient for others who use the > > data in HDFS. > > Maybe we should provide a configuration for users to choose which > > strategy they prefer. > > Do you have any ideas? > > > > > > -- -- > > *??:* "Timo Walther" ; > > *:* 2018??5??15??(??) 7:30 > > *??:* "dev" ; > > *:* Re: Rewriting a new file instead of writing a ".valid-length" > > file inBucketSink when restoring > > > > I guess writing a new file would take much longer than just using the > > .valid-length file, especially if the files are very large. The > > restoring time should be as minimal as possible to ensure little > > downtime on restarts. > > > > Regards, > > Timo > > > > > > Am 15.05.18 um 09:31 schrieb Gary Yao: > > > Hi, > > > > > > The BucketingSink truncates the file if the Hadoop FileSystem > > supports this > > > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you > > using? > > > > > > Best, > > > Gary > > > > > > [1] > > > > > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196 > b2fdaf85e0/flink-connectors/flink-connector-filesystem/ > src/main/java/org/apache/flink/streaming/connectors/fs/ > bucketing/BucketingSink.java#L301 > > > > > > On Mon, May 14, 2018 at 1:37 PM, ?? <342689...@qq.com> wrote: > > > > > >> Hi > > >> > > >> > > >> I'm trying to copy data from kafka to HDFS . The data in HDFS is > > used to > > >> do other computations by others in map/reduce. > > >> If some tasks failed, the ".valid-length" file is created for the low > > >> version hadoop. The problem is other people must know how to deal > > with the > > >> ".valid-length" file, otherwise, the data may be not exactly-once. > > >> Hence, why not rewrite a new file when restoring instead of writing a > > >> ".valid-length" file. In this way, others who use the data in HDFS > > don't > > >> need to know how to deal with the ".valid-length" file. > > >> > > >> > > >> Thanks! > > >> > > >> > > >> Zhang Xinyu > > >
?????? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring
Hi gary Our Hadoop version is very low and does not support truncate function. Some people are working on upgrading it. I just think the ".valid-length" file is not friendly enough for users. Maybe it should be configurable to choose whether using the ".valid-length" file at least. Regards, Xinyu Zhang -- -- ??: "Gary Yao"; : 2018??5??15??(??) 3:31 ??: "dev" ; : "Xinyu Zhang"<342689...@qq.com>; : Re: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring Hi, The BucketingSink truncates the file if the Hadoop FileSystem supports this operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you using? Best, Gary [1] https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196b2fdaf85e0/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L301 On Mon, May 14, 2018 at 1:37 PM, ?? <342689...@qq.com> wrote: > Hi > > > I'm trying to copy data from kafka to HDFS . The data in HDFS is used to > do other computations by others in map/reduce. > If some tasks failed, the ".valid-length" file is created for the low > version hadoop. The problem is other people must know how to deal with the > ".valid-length" file, otherwise, the data may be not exactly-once. > Hence, why not rewrite a new file when restoring instead of writing a > ".valid-length" file. In this way, others who use the data in HDFS don't > need to know how to deal with the ".valid-length" file. > > > Thanks! > > > Zhang Xinyu
Re: 回复: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring
Hi Xinyu, would it help to have a small tool which can truncate the finished files which have a valid-length file associated? That way, one could use this tool before others are using the data farther down stream. Cheers, Till On Tue, May 15, 2018 at 3:05 PM, Xinyu Zhang <342689...@qq.com> wrote: > Yes, I'm glad to do it. but I'm not sure writing a new file is a good > solution. So I want to discuss it here. > Do you have any ideas? @Kostas > > > > > -- 原始邮件 -- > 发件人: "twalthr"; > 发送时间: 2018年5月15日(星期二) 晚上8:21 > 收件人: "Xinyu Zhang"<342689...@qq.com>; > 抄送: "dev" ; "kkloudas" ; > 主题: Re: 回复: Rewriting a new file instead of writing a ".valid-length" file > inBucketSink when restoring > > > > As far as I know, the bucketing sink is currenlty also limited by > relying on Hadoops file system abstraction. It is planned to switch to > Flink's file system abstraction which might also improve this situation. > Kostas (in CC) might know more about it. > > But I think we can discuss if an other behavior should be configurable > as well. Would you be willing to contribute? > > Regards, > Timo > > > Am 15.05.18 um 14:01 schrieb Xinyu Zhang: > > Thanks for your reply. > > Indeed, if a file is very large, it will take a long time. However, > > the the “.valid-length” file is not convenient for others who use the > > data in HDFS. > > Maybe we should provide a configuration for users to choose which > > strategy they prefer. > > Do you have any ideas? > > > > > > -- 原始邮件 -- > > *发件人:* "Timo Walther" ; > > *发送时间:* 2018年5月15日(星期二) 晚上7:30 > > *收件人:* "dev" ; > > *主题:* Re: Rewriting a new file instead of writing a ".valid-length" > > file inBucketSink when restoring > > > > I guess writing a new file would take much longer than just using the > > .valid-length file, especially if the files are very large. The > > restoring time should be as minimal as possible to ensure little > > downtime on restarts. > > > > Regards, > > Timo > > > > > > Am 15.05.18 um 09:31 schrieb Gary Yao: > > > Hi, > > > > > > The BucketingSink truncates the file if the Hadoop FileSystem > > supports this > > > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you > > using? > > > > > > Best, > > > Gary > > > > > > [1] > > > > > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196 > b2fdaf85e0/flink-connectors/flink-connector-filesystem/ > src/main/java/org/apache/flink/streaming/connectors/fs/ > bucketing/BucketingSink.java#L301 > > > > > > On Mon, May 14, 2018 at 1:37 PM, 张馨予 <342689...@qq.com> wrote: > > > > > >> Hi > > >> > > >> > > >> I'm trying to copy data from kafka to HDFS . The data in HDFS is > > used to > > >> do other computations by others in map/reduce. > > >> If some tasks failed, the ".valid-length" file is created for the low > > >> version hadoop. The problem is other people must know how to deal > > with the > > >> ".valid-length" file, otherwise, the data may be not exactly-once. > > >> Hence, why not rewrite a new file when restoring instead of writing a > > >> ".valid-length" file. In this way, others who use the data in HDFS > > don't > > >> need to know how to deal with the ".valid-length" file. > > >> > > >> > > >> Thanks! > > >> > > >> > > >> Zhang Xinyu > > >
?????? ?????? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring
Yes, I'm glad to do it. but I'm not sure writing a new file is a good solution. So I want to discuss it here. Do you have any ideas? @Kostas -- -- ??: "twalthr"; : 2018??5??15??(??) 8:21 ??: "Xinyu Zhang"<342689...@qq.com>; : "dev" ; "kkloudas" ; : Re: ?? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring As far as I know, the bucketing sink is currenlty also limited by relying on Hadoops file system abstraction. It is planned to switch to Flink's file system abstraction which might also improve this situation. Kostas (in CC) might know more about it. But I think we can discuss if an other behavior should be configurable as well. Would you be willing to contribute? Regards, Timo Am 15.05.18 um 14:01 schrieb Xinyu Zhang: > Thanks for your reply. > Indeed, if a file is very large, it will take a long time. However, > the the ??.valid-length?? file is not convenient for others who use the > data in HDFS. > Maybe we should provide a configuration for users to choose which > strategy they prefer. > Do you have any ideas? > > > -- -- > *??:* "Timo Walther" ; > *:* 2018??5??15??(??) 7:30 > *??:* "dev" ; > *:* Re: Rewriting a new file instead of writing a ".valid-length" > file inBucketSink when restoring > > I guess writing a new file would take much longer than just using the > .valid-length file, especially if the files are very large. The > restoring time should be as minimal as possible to ensure little > downtime on restarts. > > Regards, > Timo > > > Am 15.05.18 um 09:31 schrieb Gary Yao: > > Hi, > > > > The BucketingSink truncates the file if the Hadoop FileSystem > supports this > > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you > using? > > > > Best, > > Gary > > > > [1] > > > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196b2fdaf85e0/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L301 > > > > On Mon, May 14, 2018 at 1:37 PM, ?? <342689...@qq.com> wrote: > > > >> Hi > >> > >> > >> I'm trying to copy data from kafka to HDFS . The data in HDFS is > used to > >> do other computations by others in map/reduce. > >> If some tasks failed, the ".valid-length" file is created for the low > >> version hadoop. The problem is other people must know how to deal > with the > >> ".valid-length" file, otherwise, the data may be not exactly-once. > >> Hence, why not rewrite a new file when restoring instead of writing a > >> ".valid-length" file. In this way, others who use the data in HDFS > don't > >> need to know how to deal with the ".valid-length" file. > >> > >> > >> Thanks! > >> > >> > >> Zhang Xinyu >
Re: 回复: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring
As far as I know, the bucketing sink is currenlty also limited by relying on Hadoops file system abstraction. It is planned to switch to Flink's file system abstraction which might also improve this situation. Kostas (in CC) might know more about it. But I think we can discuss if an other behavior should be configurable as well. Would you be willing to contribute? Regards, Timo Am 15.05.18 um 14:01 schrieb Xinyu Zhang: Thanks for your reply. Indeed, if a file is very large, it will take a long time. However, the the ??.valid-length?? file is not?0?2convenient for others who use the data in HDFS. Maybe we should provide a configuration for users to choose which strategy they prefer. Do you have any ideas? --?0?2?0?2-- *??:*?0?2"Timo Walther"; *:*?0?22018??5??15??(??) 7:30 *??:*?0?2"dev" ; *:*?0?2Re: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring I guess writing a new file would take much longer than just using the .valid-length file, especially if the files are very large. The restoring time should be as minimal as possible to ensure little downtime on restarts. Regards, Timo Am 15.05.18 um 09:31 schrieb Gary Yao: > Hi, > > The BucketingSink truncates the file if the Hadoop FileSystem supports this > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you using? > > Best, > Gary > > [1] > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196b2fdaf85e0/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L301 > > On Mon, May 14, 2018 at 1:37 PM, ?? <342689...@qq.com> wrote: > >> Hi >> >> >> I'm trying to copy data from kafka to HDFS . The data in HDFS is used to >> do other computations by others in map/reduce. >> If some tasks failed, the ".valid-length" file is created for the low >> version hadoop. The problem is other people must know how to deal with the >> ".valid-length" file, otherwise, the data may be not exactly-once. >> Hence, why not rewrite a new file when restoring instead of writing a >> ".valid-length" file. In this way, others who use the data in HDFS don't >> need to know how to deal with the ".valid-length" file. >> >> >> Thanks! >> >> >> Zhang Xinyu
?????? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring
Thanks for your reply. Indeed, if a file is very large, it will take a long time. However, the the ??.valid-length?? file is not convenient for others who use the data in HDFS. Maybe we should provide a configuration for users to choose which strategy they prefer. Do you have any ideas? -- -- ??: "Timo Walther"; : 2018??5??15??(??) 7:30 ??: "dev" ; : Re: Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring I guess writing a new file would take much longer than just using the .valid-length file, especially if the files are very large. The restoring time should be as minimal as possible to ensure little downtime on restarts. Regards, Timo Am 15.05.18 um 09:31 schrieb Gary Yao: > Hi, > > The BucketingSink truncates the file if the Hadoop FileSystem supports this > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you using? > > Best, > Gary > > [1] > https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196b2fdaf85e0/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L301 > > On Mon, May 14, 2018 at 1:37 PM, ?? <342689...@qq.com> wrote: > >> Hi >> >> >> I'm trying to copy data from kafka to HDFS . The data in HDFS is used to >> do other computations by others in map/reduce. >> If some tasks failed, the ".valid-length" file is created for the low >> version hadoop. The problem is other people must know how to deal with the >> ".valid-length" file, otherwise, the data may be not exactly-once. >> Hence, why not rewrite a new file when restoring instead of writing a >> ".valid-length" file. In this way, others who use the data in HDFS don't >> need to know how to deal with the ".valid-length" file. >> >> >> Thanks! >> >> >> Zhang Xinyu