Re: [PR] [SPARK-20144] Allow reading files in order with spark.sql.files.allowReordering=false [spark]

via GitHub Wed, 24 Jul 2024 06:06:16 -0700


darabos commented on PR #22673:
URL: https://github.com/apache/spark/pull/22673#issuecomment-2247876095

> I know this is an old issue but does anyone know if this has changed in
more recent versions of Spark? Is reading sorted data read in by spark in the
same order? @darabos or @dgrnbrg do either of you know?

Sorry, I haven't been keeping up. But if this is still an issue, one thing
you could try is a custom datasource. I wrote one for a different issue: Spark
automatically revises the number of partitions when it loads a Parquet file.
This was a problem for us, because we hash-partitioned the data before saving,
so any change in the number of partitions messed that up. This code should be
portable and it's available under AGPL and Apache 2 license:
https://github.com/lynxkite/lynxkite/blob/5.4.1/app/com/lynxanalytics/biggraph/partitioned_parquet/PartitionedParquet.scala

This is basically the same as Spark's Parquet datasource, just with all the
intelligent bits stripped out. 😄 I think it's worth a try — maybe it keeps the
order too. The last Spark version I tested it with is Spark 3.3.2.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-20144] Allow reading files in order with spark.sql.files.allowReordering=false [spark]

Reply via email to