[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

MaxGekk Tue, 27 Nov 2018 05:23:20 -0800

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23052#discussion_r236659185
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
    @@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
         context: TaskAttemptContext,
         params: CSVOptions) extends OutputWriter with Logging {
     
    -  private val charset = Charset.forName(params.charset)
    +  private var univocityGenerator: Option[UnivocityGenerator] = None
    --- End diff --
    
    > ... but that it could create many generators and writers that aren't 
closed. 
    
    Writers/generators are created inside of tasks: 
https://github.com/apache/spark/blob/ab1650d2938db4901b8c28df945d6a0691a19d31/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L228-L256
 where `dataWriter.commit()` and `dataWriter.abort()` close writers/generators. 
So, number of not closed generators is less or equal to the size of the task 
thread pool on executors at any moment.
    
    > Unless we know writes will only happen in one thread ...
    
    According to comments below, this is our assumption: 
https://github.com/apache/spark/blob/e8167768cfebfdb11acd8e0a06fe34ca43c14648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L33-L37




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #23052: [SPARK-26081][SQL] Prevent empty files for empty ...

Reply via email to