Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/23052#discussion_r236659185
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
---
@@ -169,13 +169,18 @@ private[csv] class CsvOutputWriter(
context: TaskAttemptContext,
params: CSVOptions) extends OutputWriter with Logging {
- private val charset = Charset.forName(params.charset)
+ private var univocityGenerator: Option[UnivocityGenerator] = None
--- End diff --
> ... but that it could create many generators and writers that aren't
closed.
Writers/generators are created inside of tasks:
https://github.com/apache/spark/blob/ab1650d2938db4901b8c28df945d6a0691a19d31/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L228-L256
where `dataWriter.commit()` and `dataWriter.abort()` close writers/generators.
So, number of not closed generators is less or equal to the size of the task
thread pool on executors at any moment.
> Unless we know writes will only happen in one thread ...
According to comments below, this is our assumption:
https://github.com/apache/spark/blob/e8167768cfebfdb11acd8e0a06fe34ca43c14648/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L33-L37
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]