xor007 edited a comment on issue #14918: [SPARK-17360][PYSPARK] Support generator in createDataFrame URL: https://github.com/apache/spark/pull/14918#issuecomment-487821340 > Do we have any usecases or benchmarks for cases where this would be helpful? Yes my huge use case which I am surprised a lot of people in industry don't have is **massive data mining**: - You have a lot of files on the internet (for instance text from a large collection of webpages) - You are able to write a python generator that goes through the files to find and ouput sentences containing the word "covfefe": I have seen a python generator go through 90G of such a real collection of 11000 files within minutes(they where downloaded) - You want to create a dataframe of all those sentences and the actual collection of those sentences ends up being less than 20Mb If only you could create a Dataset from the generator. Now that I have written this it seems I can run flatmap on the list of files with what the generator does as the transformation. But something like Dataframe.from_generator in spark would be nice.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
