[
https://issues.apache.org/jira/browse/SPARK-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-7001.
------------------------------
Resolution: Not A Problem
... though this may well BeAProblem in TextInputFormat itself in this extreme
corner case.
> Partitions for a long single line file
> --------------------------------------
>
> Key: SPARK-7001
> URL: https://issues.apache.org/jira/browse/SPARK-7001
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Reporter: Victor Bashurov
>
> Here is the issue from stackoverflow.com
> (http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)
> I am using Spark 1.2.1 (local mode) to extract and process log information
> from file. The size of the file could be more than 100Mb. The file contains a
> very long single line so I'm using regular expression to split this file into
> log data rows.
> MyApp.java
> JavaSparkContext sc = new JavaSparkContext(conf);
> JavaRDD<String> txtFileRdd = sc.textFile(filename);
> JavaRDD<MyLog> logRDD =
> txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();
> LogParser.java
> public static Iterable<MyLog> parseFromLogLine(String logline) {
> List<MyLog> logs = new LinkedList<MyLog>();
> Matcher m = PATTERN.matcher(logline);
> while (m.find()) {
> logs.add(new MyLog(m.group(0)));
> }
> System.out.println("Logs detected " + logs.size());
> return logs;
> }
> Actual size of the processed file is about 100 Mb and it actually contains
> 323863 log items. When I use Spark to extract my log items from file I get
> 455651 [logRDD.count()] log items which is not correct.
> I think it happens because of file partitions, checking the output I see the
> following:
> Logs detected 18694
> Logs detected 113104
> Logs detected 323863
> And the total sum is 455651! So I see that my partitions are merged with each
> other keeping duplicate items and I need to prevent that behavior. The
> workaround is the following method:
> txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();
> And it'll give me the desired result 323863, but I don't think that it's good
> for performance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]