Github user igozali commented on the issue:
https://github.com/apache/spark/pull/16724
My original use case for sorting the output files based on timestamp using
Spark was to use the output files with some other machine learning framework
which might not readily work well with very large data files, like TensorFlow
or Theano. The benefit that I was trying to get was to offload the sorting to
Spark, since even if I ended up with large CSV files I could potentially mmap
the CSV files to be used with the subsequent frameworks (TF/Theano).
I thought this could be a relatively common use case, but from the
impressions I'm getting from this discussion, I wonder if this is not a
paradigm that Spark supports or encourages?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]