[ https://issues.apache.org/jira/browse/SPARK-22225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198326#comment-16198326 ]
sam commented on SPARK-22225: ----------------------------- Thanks [~srowen] and [~hyukjin.kwon], I wasn't aware of either of these approaches and indeed they suffice for the communities needs. This ticket will serve for easy googling of others in future. > wholeTextFilesIterators > ----------------------- > > Key: SPARK-22225 > URL: https://issues.apache.org/jira/browse/SPARK-22225 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: sam > > It is a very common use case to want to preserve a path -> file mapping in an > RDD, or read an entire file in one go. Especially for unstructured data and > ETL. > Currently wholeTextFiles is the goto method for this, but it read the entire > file into memory, which is sometimes an issue (see SPARK-18965). It also > precludes the option to lazily process files. > It would be nice to have a method with the following signature: > {code} > def wholeTextFilesIterators( > path: String, > minPartitions: Int = defaultMinPartitions, > delimiter: String = "\n"): RDD[(String, Iterator[String])] > {code} > Where each `Iterator[String]` is a lazy file iterator where each string is > delimited by the `delimiter` field. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org