Github user igozali commented on the issue:

    https://github.com/apache/spark/pull/16724
  
    My original use case for sorting the output files based on timestamp using 
Spark was to use the output files with some other machine learning framework 
which might not readily work well with very large data files, like TensorFlow 
or Theano. The benefit that I was trying to get was to offload the sorting to 
Spark, since even if I ended up with large CSV files I could potentially mmap 
the CSV files to be used with the subsequent frameworks (TF/Theano).
    
    I thought this could be a relatively common use case, but from the 
impressions I'm getting from this discussion, I wonder if this is not a 
paradigm that Spark supports or encourages?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to