gaborgsomogyi commented on issue #23782: [SPARK-26875][SQL] Add an option on 
FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-463606075
 
 
   The question is why some producer generates the same file again?
   
   From data source perspective I see mainly 2 actually implemented ways:
   * Atomic move to a directory (several engines does that but Spark does it 
differently because S3 moves the files with copy for example)
   * Write the file a non-atomic way but update metadata file with the already 
properly written filename. Here the available files are coming from the 
metadata and all others considered junk.
   
   +1 @HeartSaVioR  and I'm worried with this patch as well.
   * Let's take any filesystem, append a file 10k times and then close it. Is 
it guaranteed that only after the last append will the timestamp updated and no 
internal OS flush touch it? If there is no guarantee random exception will be 
thrown by the SQL engine because maybe half of a row written out.
   * Let's take S3 as another example. Even with S3 guard the file modified, 
the metadata shows the file is there but because of it's read-after-write 
consistency the file content can be
     * The original one
     * The new one
     * Empty file
   This change may increase this behaviour.
   
   All in all with my actual understanding I would change the producer.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to