darabos commented on PR #22673:
URL: https://github.com/apache/spark/pull/22673#issuecomment-2247876095

   > I know this is an old issue but does anyone know if this has changed in 
more recent versions of Spark? Is reading sorted data read in by spark in the 
same order? @darabos or @dgrnbrg do either of you know?
   
   Sorry, I haven't been keeping up. But if this is still an issue, one thing 
you could try is a custom datasource. I wrote one for a different issue: Spark 
automatically revises the number of partitions when it loads a Parquet file. 
This was a problem for us, because we hash-partitioned the data before saving, 
so any change in the number of partitions messed that up. This code should be 
portable and it's available under AGPL and Apache 2 license: 
https://github.com/lynxkite/lynxkite/blob/5.4.1/app/com/lynxanalytics/biggraph/partitioned_parquet/PartitionedParquet.scala
   
   This is basically the same as Spark's Parquet datasource, just with all the 
intelligent bits stripped out. 😄 I think it's worth a try — maybe it keeps the 
order too. The last Spark version I tested it with is Spark 3.3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to