darabos commented on PR #22673: URL: https://github.com/apache/spark/pull/22673#issuecomment-2247876095
> I know this is an old issue but does anyone know if this has changed in more recent versions of Spark? Is reading sorted data read in by spark in the same order? @darabos or @dgrnbrg do either of you know? Sorry, I haven't been keeping up. But if this is still an issue, one thing you could try is a custom datasource. I wrote one for a different issue: Spark automatically revises the number of partitions when it loads a Parquet file. This was a problem for us, because we hash-partitioned the data before saving, so any change in the number of partitions messed that up. This code should be portable and it's available under AGPL and Apache 2 license: https://github.com/lynxkite/lynxkite/blob/5.4.1/app/com/lynxanalytics/biggraph/partitioned_parquet/PartitionedParquet.scala This is basically the same as Spark's Parquet datasource, just with all the intelligent bits stripped out. 😄 I think it's worth a try — maybe it keeps the order too. The last Spark version I tested it with is Spark 3.3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
