Victor Bashurov created SPARK-7001:
--------------------------------------

             Summary: Partitions for a long single line file
                 Key: SPARK-7001
                 URL: https://issues.apache.org/jira/browse/SPARK-7001
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.3.0, 1.2.1, 1.2.0
            Reporter: Victor Bashurov


Here is the issue from  stackoverflow.com 
(http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)

I am using Spark 1.2.1 (local mode) to extract and process log information from 
file. The size of the file could be more than 100Mb. The file contains a very 
long single line so I'm using regular expression to split this file into log 
data rows.

MyApp.java
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> txtFileRdd = sc.textFile(filename);
JavaRDD<MyLog> logRDD = txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();

LogParser.java
public static Iterable<MyLog> parseFromLogLine(String logline) {
        List<MyLog> logs = new LinkedList<MyLog>();
        Matcher m = PATTERN.matcher(logline);
        while (m.find()) {          
            logs.add(new MyLog(m.group(0)));            
        }   
        System.out.println("Logs detected " + logs.size());
        return logs;
}

Actual size of the processed file is about 100 Mb and it actually contains 
323863 log items. When I use Spark to extract my log items from file I get 
455651 [logRDD.count()] log items which is not correct.
I think it happens because of file partitions, checking the output I see the 
following:
Logs detected 18694
Logs detected 113104
Logs detected 323863
And the total sum is 455651! So I see that my partitions are merged with each 
other keeping duplicate items and I need to prevent that behavior. The 
workaround is the following method:

txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();

And it'll give me the desired result 323863, but I don't think that it's good 
for performance




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to