[
https://issues.apache.org/jira/browse/SPARK-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502314#comment-14502314
]
Sean Owen commented on SPARK-7001:
----------------------------------
First, it's probably important to note that this is not a reasonable usage of
the textFile method. You are creating an RDD with one object in it. It is not
at all distributed, then, and you have to fit your entire file into memory.
That said the behavior is weird, although it's very likely not a Spark issue,
since the logic in question here comes from Hadoop MR.
(You aren't actually pointing this at a directory of files and maybe picking up
others? and you are seeing 455651 as the result of count() directly, not just
your inference from your sum right?)
CC [~tomwhite] since this might be an interesting question of what happens in
TextInputFormat in this situation. It kind of looks like your one file spans 3
blocks. TextInputFormat can't break records at all here. Normally it's capable
of stitching together a record/line spanning 2 blocks, no problem. Here it's
like it ends up treating blocks 3, 2+3, and 1+2+3 as independent records,
because maybe it sees them begin in one block but never terminate? Just taking
guesses at anything that would match the observations.
But I think this is probably at least not a Spark issue, and something you
would want to design differently anyway. For example, use TextInputFormat to
split on something that you are splitting on in your regex!
> Partitions for a long single line file
> --------------------------------------
>
> Key: SPARK-7001
> URL: https://issues.apache.org/jira/browse/SPARK-7001
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Reporter: Victor Bashurov
>
> Here is the issue from stackoverflow.com
> (http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)
> I am using Spark 1.2.1 (local mode) to extract and process log information
> from file. The size of the file could be more than 100Mb. The file contains a
> very long single line so I'm using regular expression to split this file into
> log data rows.
> MyApp.java
> JavaSparkContext sc = new JavaSparkContext(conf);
> JavaRDD<String> txtFileRdd = sc.textFile(filename);
> JavaRDD<MyLog> logRDD =
> txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();
> LogParser.java
> public static Iterable<MyLog> parseFromLogLine(String logline) {
> List<MyLog> logs = new LinkedList<MyLog>();
> Matcher m = PATTERN.matcher(logline);
> while (m.find()) {
> logs.add(new MyLog(m.group(0)));
> }
> System.out.println("Logs detected " + logs.size());
> return logs;
> }
> Actual size of the processed file is about 100 Mb and it actually contains
> 323863 log items. When I use Spark to extract my log items from file I get
> 455651 [logRDD.count()] log items which is not correct.
> I think it happens because of file partitions, checking the output I see the
> following:
> Logs detected 18694
> Logs detected 113104
> Logs detected 323863
> And the total sum is 455651! So I see that my partitions are merged with each
> other keeping duplicate items and I need to prevent that behavior. The
> workaround is the following method:
> txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();
> And it'll give me the desired result 323863, but I don't think that it's good
> for performance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]