[GitHub] [spark] xor007 edited a comment on issue #14918: [SPARK-17360][PYSPARK] Support generator in createDataFrame

GitBox Mon, 29 Apr 2019 23:24:01 -0700

xor007 edited a comment on issue #14918: [SPARK-17360][PYSPARK] Support 
generator in createDataFrame
URL: https://github.com/apache/spark/pull/14918#issuecomment-487821340
 
 
   > Do we have any usecases or benchmarks for cases where this would be 
helpful?
   
   Yes my huge use case which I am surprised a lot of people in industry don't 
have is **massive data mining**:
   
   - You have a lot of files on the internet (for instance text from a large 
collection of webpages)
   - You are able to write a python generator that goes through the files to 
find and ouput sentences containing the word "covfefe": I have seen a python 
generator go through 90G of such a real collection of 11000 files within 
minutes(they where downloaded)
   - You want to create a dataframe of all those sentences and the actual 
collection of those sentences ends up being less than 20Mb
   
   If only you could create a Dataset from the generator. 
   
   Now that I have written this it seems I can run flatmap on the list of files 
with what the generator does as the transformation.
   
   But something like Dataframe.from_generator in spark would be nice.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xor007 edited a comment on issue #14918: [SPARK-17360][PYSPARK] Support generator in createDataFrame

Reply via email to