[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

frreiss Mon, 15 Aug 2016 11:10:49 -0700

Github user frreiss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r74805700
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala
 ---
    @@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
         verifyFrame(spark.read.text(testFile))
       }
     
    +  test("reading text file with wholetext option on") {
    --- End diff --
    
    As far as I'm aware, the most common use case for reading entire files is 
using a glob to read a directory or directory tree containing multiple files. 
For example, one might download the Enron corpus (see 
[https://www.cs.cmu.edu/~./enron/]), which comes packaged with one file per 
email message. With a large number of files on the input, it's important that 
the work of processing the files be split among many cores. So the test for the 
`wholetext` option really should have multiple input files and verify that 
different files end up in different partitions of the resulting RDD or 
Dataframe.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Reply via email to