Re: [PR] [FLINK-29114][connector][filesystem] Fix issue of file overwriting caused by multiple writes to the same sink table and shared staging directory [flink]

via GitHub Thu, 14 Mar 2024 04:40:43 -0700


LadyForest commented on code in PR #24492:
URL: https://github.com/apache/flink/pull/24492#discussion_r1524695213



##########
flink-architecture-tests/flink-architecture-tests-production/archunit-violations/f7a4e6fa-e7de-48c9-a61e-c13e83f0c72e:
##########
@@ -679,17 +679,18 @@ Method 
<org.apache.flink.connector.file.table.DynamicPartitionWriter.write(java.
 Method 
<org.apache.flink.connector.file.table.FileInfoExtractorBulkFormat.wrapReader(org.apache.flink.connector.file.src.reader.BulkFormat$Reader,
 org.apache.flink.connector.file.src.FileSourceSplit)> calls method 
<org.apache.flink.table.utils.PartitionPathUtils.convertStringToInternalValue(java.lang.String,
 org.apache.flink.table.types.DataType)> in 
(FileInfoExtractorBulkFormat.java:156)
 Method 
<org.apache.flink.connector.file.table.FileInfoExtractorBulkFormat.wrapReader(org.apache.flink.connector.file.src.reader.BulkFormat$Reader,
 org.apache.flink.connector.file.src.FileSourceSplit)> calls method 
<org.apache.flink.table.utils.PartitionPathUtils.extractPartitionSpecFromPath(org.apache.flink.core.fs.Path)>
 in (FileInfoExtractorBulkFormat.java:140)
 Method 
<org.apache.flink.connector.file.table.FileSystemCommitter.commitPartitionsWithFiles(java.util.Map)>
 calls method 
<org.apache.flink.table.utils.PartitionPathUtils.extractPartitionSpecFromPath(org.apache.flink.core.fs.Path)>
 in (FileSystemCommitter.java:146)
-Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:288)
-Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:289)
-Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:290)
-Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:291)
-Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:292)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:324)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:325)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:326)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:327)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.build()> 
calls method 
<org.apache.flink.util.Preconditions.checkNotNull(java.lang.Object, 
java.lang.String)> in (FileSystemOutputFormat.java:328)
 Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.setOutputFileConfig(org.apache.flink.streaming.api.functions.sink.filesystem.OutputFileConfig)>
 has parameter of type 
<org.apache.flink.streaming.api.functions.sink.filesystem.OutputFileConfig> in 
(FileSystemOutputFormat.java:0)
-Method 
<org.apache.flink.connector.file.table.FileSystemTableSink$TableBucketAssigner.getBucketId(org.apache.flink.table.data.RowData,
 
org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner$Context)>
 calls method 
<org.apache.flink.table.utils.PartitionPathUtils.generatePartitionPath(java.util.LinkedHashMap)>
 in (FileSystemTableSink.java:566)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat$Builder.setStagingPath(org.apache.flink.core.fs.Path)>
 is annotated with <org.apache.flink.annotation.VisibleForTesting> in 
(FileSystemOutputFormat.java:291)
+Method 
<org.apache.flink.connector.file.table.FileSystemOutputFormat.createStagingDirectory(org.apache.flink.core.fs.Path)>
 calls method <org.apache.flink.util.Preconditions.checkState(boolean, 
java.lang.String, [Ljava.lang.Object;)> in (FileSystemOutputFormat.java:109)

Review Comment:
   > I don't know why my comment didn't make it into the previous approval. 🤔
   > 
   > > I'm wondering why these rule updates are necessary. The 
`fileSystemOutputFormat` class is annotated as `@Internal` API. Do you 
understand why we need add these exclusions? Or is it an indication that we 
have too strict ArchUnit rules apply here? 🤔
   
   This is exactly the point I wish to discuss with you.I think there are some 
areas where the Arch rule could be optimized. 
   
   Firstly, the constraint for connectors is due to the fact that connector 
codebases such as Kafka and Elasticsearch have been removed from Flink and 
turned into external repositories. Therefore, to prevent the internal code 
changes in Flink from affecting the compilation of connectors in other 
repositories, a rule `CONNECTOR_CLASSES_ONLY_DEPEND_ON_PUBLIC_API` was added in 
the arch rule. However, as for the filesystem connector, since it is part of 
the same code repository as Flink, perhaps we could exclude it from this rule. 
On the other hand, most of the violations arise from `Preconditions#checkXXX`, 
which leads me to consider whether we should promote `Preconditions` to be a 
`Public` or `PublicEvolving` interface.
   
   Secondly, I find updating the violation file quite challenging. According to 
the readme, turning on freeze.refreeze=true in the archunit.properties file and 
re-running the tests should update the violation file. However, the method 
diffs include line numbers, which means that if I submit it all at once, it 
would be very painful for the reviewer. If I don't want the reviewer to suffer, 
I can only compare the diffs line by line and revert the parts where only the 
line numbers have changed. Therefore, I suggest that line numbers should not be 
printed in this file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-29114][connector][filesystem] Fix issue of file overwriting caused by multiple writes to the same sink table and shared staging directory [flink]

Reply via email to