Hello,

I have observed that if a DataFrame is saved with partitioning columns in 
Parquet, then a sort is performed in FileFormatWriter (see 
https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L152)
 because DynamicPartitionDataWriter only supports having a single file open at 
a time (see 
https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala#L170-L171).
 I think it would be possible to avoid this sort (which is a major bottleneck 
for some of my scenarios) if DynamicPartitionDataWriter could have multiple 
files open at the same time, and writing each piece of data to its 
corresponding file.

Would that change be a welcome PR for the project or is there any major problem 
that I am not considering that would prevent removing this sort?

Thanks,
Ximo.




Some more detail about the problem, in case I didn't explain myself correctly: 
suppose we have a dataframe which we want to partition by column A:

Column A
Column B
4
A
1
B
2
C

The current behavior will first sort the dataframe:

Column A
Column B
1
B
2
C
4
A

So that DynamicPartitionDataWriter can have a single file open, since all the 
data for a single partition will be adjacent and can be iterated over 
sequentially. In order to process the first row, DynamicPartitionDataWriter 
will open a file in /columnA=1/part-r-00000-<uuid>.parquet and write the data. 
When processing the second row it will see it belongs to a different partition, 
closet he first file and open a new file in 
/columna=2/part-r-00000-<uuid>.parquet and so on.

My proposed change would involve changing DynamicPartitionDataWriter to have as 
many open files as partitions, and close them all once all data has been 
processed.

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener informaci?n privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilizaci?n, divulgaci?n y/o copia sin 
autorizaci?n puede estar prohibida en virtud de la legislaci?n vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma v?a y proceda a su destrucci?n.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinat?rio, pode 
conter informa??o privilegiada ou confidencial e ? para uso exclusivo da pessoa 
ou entidade de destino. Se n?o ? vossa senhoria o destinat?rio indicado, fica 
notificado de que a leitura, utiliza??o, divulga??o e/ou c?pia sem autoriza??o 
pode estar proibida em virtude da legisla??o vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destrui??o

Reply via email to