[GitHub] [seatunnel] yangzhiyuss commented on a diff in pull request #5428: [FixBug][HdfsSource]Filter out empty and dirty files

via GitHub Tue, 05 Sep 2023 18:23:52 -0700


yangzhiyuss commented on code in PR #5428:
URL: https://github.com/apache/seatunnel/pull/5428#discussion_r1316574239



##########
seatunnel-connectors-v2/connector-file/connector-file-base-hadoop/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/hdfs/source/BaseHdfsFileSource.java:
##########
@@ -110,13 +112,26 @@ public void prepare(Config pluginConfig) throws 
PrepareFailException {
                             "SeaTunnel does not supported this file format");
             }
         } else {
-            try {
-                rowType = readStrategy.getSeaTunnelRowTypeInfo(hadoopConf, 
filePaths.get(0));
-            } catch (FileConnectorException e) {
+            FileConnectorException fileConnectorException = null;

Review Comment:
   Sometimes during data migration, some such files will be generated due to 
the network or hadoop system itself, but hadoop itself will not take the 
initiative to clean up。
   for example:
   
![image](https://github.com/apache/seatunnel/assets/28888024/3db4af69-b520-4284-b5d2-047f1056c112)
   
![image](https://github.com/apache/seatunnel/assets/28888024/b16ace4b-5844-464c-ba70-5484cd357b33)
   When a dirty or empty file appears, the hdfsfile source will fail to get the 
rowtype, because the original code only parses the first file, which may be an 
empty temporary file or a dirty file.However, these files have no impact on 
hive, and hive can still query them
   
![image](https://github.com/apache/seatunnel/assets/28888024/c4afd817-fe2e-47f8-899f-87ac8b88c712)
   After modifying the code, these files can be filtered out, and can be pulled 
successfully, and the data is not lost
   
![image](https://github.com/apache/seatunnel/assets/28888024/2713b019-8344-47b1-a74c-375fab1f59fe)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [seatunnel] yangzhiyuss commented on a diff in pull request #5428: [FixBug][HdfsSource]Filter out empty and dirty files

Reply via email to