[ https://issues.apache.org/jira/browse/SPARK-17633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513125#comment-15513125 ]
Sean Owen commented on SPARK-17633: ----------------------------------- Yeah I can reproduce that. It is weird behavior, but, it's due to the fact that Spark/HDFS APIs generally expect that the file does not change. I see that it does re-read the file as expected in this scenario, so it reflects changes in the line count. Adding a line increases the count, taking it away decreases it back to 2. Adding more than one line only adds 1 to the count. Making a new RDD and counting again works as expected. Removing lines after counting causes an exception because it finds the file isn't as long as it expected. I think this falls into the category of unsupported usage, but it is surprising to a user that it sort of works but not quite. > texFile() and wholeTextFiles() count difference > ----------------------------------------------- > > Key: SPARK-17633 > URL: https://issues.apache.org/jira/browse/SPARK-17633 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 1.6.2 > Environment: Unix/Linux > Reporter: Anshul > > sc.textFile() creates an RDD of string from a text file. > After that when count is performed, the line count is correct, but if more > than one line is appended to the file manually and counting the same RDD of > string increments the output/result only by 1. > But in case of sc.wholeTextFiles() the output/result is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org