The issue is memory overhead. Writing files create a lot of buffer (especially in columnar formats like Parquet/ORC). Even a few file handlers and buffers per task can OOM the entire process easily.
On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com > wrote: > > > > Hello, > > > > > > > > I have observed that if a DataFrame is saved with partitioning columns in > Parquet, then a sort is performed in FileFormatWriter (see https:/ / github. > com/ apache/ spark/ blob/ v3. 0. 0/ sql/ core/ src/ main/ scala/ org/ apache/ > spark/ sql/ execution/ datasources/ FileFormatWriter. scala#L152 ( > https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L152 > ) ) because DynamicPartitionDataWriter only supports having a single file > open at a time (see https:/ / github. com/ apache/ spark/ blob/ v3. 0. 0/ sql/ > core/ src/ main/ scala/ org/ apache/ spark/ sql/ execution/ datasources/ > FileFormatDataWriter. > scala#L170-L171 ( > https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L170-L171 > ) ). I think it would be possible to avoid this sort (which is a major > bottleneck for some of my scenarios) if DynamicPartitionDataWriter could > have multiple files open at the same time, and writing each piece of data > to its corresponding file. > > > > > > > > Would that change be a welcome PR for the project or is there any major > problem that I am not considering that would prevent removing this sort? > > > > > > > > Thanks, > > > > Ximo. > > > > > > > > > > > > > > > > > > > > > Some more detail about the problem, in case I didn’t explain myself > correctly: suppose we have a dataframe which we want to partition by > column A: > > > > > > > > > > *Column A* > > > > *Column B* > > > > 4 > > > > A > > > > 1 > > > > B > > > > 2 > > > > C > > > > > > > > > > The current behavior will first sort the dataframe: > > > > > > > > > > *Column A* > > > > *Column B* > > > > 1 > > > > B > > > > 2 > > > > C > > > > 4 > > > > A > > > > > > > > > > So that DynamicPartitionDataWriter can have a single file open, since all > the data for a single partition will be adjacent and can be iterated over > sequentially. In order to process the first row, > DynamicPartitionDataWriter will open a file in > /columnA=1/part-r-00000-<uuid>.parquet and write the data. When processing > the second row it will see it belongs to a different partition, closet he > first file and open a new file in /columna=2/part-r-00000-<uuid>.parquet > and so on. > > > > > > > > My proposed change would involve changing DynamicPartitionDataWriter to > have as many open files as partitions, and close them all once all data > has been processed. > > > > > Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, > puede contener información privilegiada o confidencial y es para uso > exclusivo de la persona o entidad de destino. Si no es usted. el > destinatario indicado, queda notificado de que la lectura, utilización, > divulgación y/o copia sin autorización puede estar prohibida en virtud de > la legislación vigente. Si ha recibido este mensaje por error, le rogamos > que nos lo comunique inmediatamente por esta misma vía y proceda a su > destrucción. > > The information contained in this transmission is privileged and > confidential information intended only for the use of the individual or > entity named above. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. If you have received > this transmission in error, do not read it. Please immediately reply to > the sender that you have received this communication in error and then > delete it. > > Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, > pode conter informação privilegiada ou confidencial e é para uso exclusivo > da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário > indicado, fica notificado de que a leitura, utilização, divulgação e/ou > cópia sem autorização pode estar proibida em virtude da legislação > vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o > comunique imediatamente por esta mesma via e proceda a sua destruição >
smime.p7s
Description: S/MIME Cryptographic Signature