[ 
https://issues.apache.org/jira/browse/SPARK-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7001.
------------------------------
    Resolution: Not A Problem

... though this may well BeAProblem in TextInputFormat itself in this extreme 
corner case.

> Partitions for a long single line file
> --------------------------------------
>
>                 Key: SPARK-7001
>                 URL: https://issues.apache.org/jira/browse/SPARK-7001
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Victor Bashurov
>
> Here is the issue from  stackoverflow.com 
> (http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)
> I am using Spark 1.2.1 (local mode) to extract and process log information 
> from file. The size of the file could be more than 100Mb. The file contains a 
> very long single line so I'm using regular expression to split this file into 
> log data rows.
> MyApp.java
> JavaSparkContext sc = new JavaSparkContext(conf);
> JavaRDD<String> txtFileRdd = sc.textFile(filename);
> JavaRDD<MyLog> logRDD = 
> txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();
> LogParser.java
> public static Iterable<MyLog> parseFromLogLine(String logline) {
>         List<MyLog> logs = new LinkedList<MyLog>();
>         Matcher m = PATTERN.matcher(logline);
>         while (m.find()) {          
>             logs.add(new MyLog(m.group(0)));            
>         }   
>         System.out.println("Logs detected " + logs.size());
>         return logs;
> }
> Actual size of the processed file is about 100 Mb and it actually contains 
> 323863 log items. When I use Spark to extract my log items from file I get 
> 455651 [logRDD.count()] log items which is not correct.
> I think it happens because of file partitions, checking the output I see the 
> following:
> Logs detected 18694
> Logs detected 113104
> Logs detected 323863
> And the total sum is 455651! So I see that my partitions are merged with each 
> other keeping duplicate items and I need to prevent that behavior. The 
> workaround is the following method:
> txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();
> And it'll give me the desired result 323863, but I don't think that it's good 
> for performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to