[GitHub] spark pull request #22952: [SPARK-20568][SS] Rename files which are complete...

HeartSaVioR Wed, 07 Nov 2018 15:29:18 -0800

Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22952#discussion_r231717554
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -530,6 +530,8 @@ Here are the details of all the sources in Spark.
             "s3://a/dataset.txt"<br/>
             "s3n://a/b/dataset.txt"<br/>
             "s3a://a/b/c/dataset.txt"<br/>
    +        <br/>
    +        <code>renameCompletedFiles</code>: whether to rename completed 
files in previous batch (default: false). If the option is enabled, input file 
will be renamed with additional postfix "_COMPLETED_". This is useful to clean 
up old input files to save space in storage.
    --- End diff --
    
    Totally agreed, and that matches the option 3 I've proposed. And option 1 
would not affect much on critical path in a batch since rename operations will 
be enqueued and background thread will take care.
    
    For option 1, guaranteeing makes the thing being complicated. If we are OK 
to NOT guarantee that all processed files are renamed, we can take the renaming 
in background (like option 1) without handling backpressure, and simply drop 
the requests in queue with logging if the size is beyond the threshold or JVM 
is shutting down.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22952: [SPARK-20568][SS] Rename files which are complete...

Reply via email to