[
https://issues.apache.org/jira/browse/SPARK-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502335#comment-14502335
]
Victor Bashurov commented on SPARK-7001:
----------------------------------------
Thanks for your answer, please see my comments below:
>>> this is not a reasonable usage of the textFile method. You are creating an
>>> RDD with one object in it
sc.textFile(filename).flatMap(LogParser::parseFromLogLine).cache()
This line of the code creates RDD with more than one object in it (if
parseFromLogLine returns the sequence with more than one value), doesn't it?
Having multiple values in RDD I'd like to iterate through all of them searching
for specific scenarios (Spark SQL context is a nice thing to use there). I'm
just curious what am I supposed to use for my feature implementation if
textFile method is not reasonable here?
>>> You aren't actually pointing this at a directory of files and maybe picking
>>> up others?
filename doesn't point at a directory but single file
> Partitions for a long single line file
> --------------------------------------
>
> Key: SPARK-7001
> URL: https://issues.apache.org/jira/browse/SPARK-7001
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Reporter: Victor Bashurov
>
> Here is the issue from stackoverflow.com
> (http://stackoverflow.com/questions/29689175/spark-partitions-processing-the-file)
> I am using Spark 1.2.1 (local mode) to extract and process log information
> from file. The size of the file could be more than 100Mb. The file contains a
> very long single line so I'm using regular expression to split this file into
> log data rows.
> MyApp.java
> JavaSparkContext sc = new JavaSparkContext(conf);
> JavaRDD<String> txtFileRdd = sc.textFile(filename);
> JavaRDD<MyLog> logRDD =
> txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();
> LogParser.java
> public static Iterable<MyLog> parseFromLogLine(String logline) {
> List<MyLog> logs = new LinkedList<MyLog>();
> Matcher m = PATTERN.matcher(logline);
> while (m.find()) {
> logs.add(new MyLog(m.group(0)));
> }
> System.out.println("Logs detected " + logs.size());
> return logs;
> }
> Actual size of the processed file is about 100 Mb and it actually contains
> 323863 log items. When I use Spark to extract my log items from file I get
> 455651 [logRDD.count()] log items which is not correct.
> I think it happens because of file partitions, checking the output I see the
> following:
> Logs detected 18694
> Logs detected 113104
> Logs detected 323863
> And the total sum is 455651! So I see that my partitions are merged with each
> other keeping duplicate items and I need to prevent that behavior. The
> workaround is the following method:
> txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();
> And it'll give me the desired result 323863, but I don't think that it's good
> for performance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]