What's the format of the file header? Is it possible to filter them out by
prefix string matching or regex?
On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO raofeng...@gmail.com wrote:
It will certainly cause bad performance, since it reads the whole content
of a large file into one value,
of course we can filter them out. A typical file head is as below:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus
Hi, all
We are migrating from mapreduce to spark, and encountered a problem.
Our input files are IIS logs with file head. It's easy to get the file head
if we process only one file, e.g.
val lines = sc.textFile('hdfs://*/u_ex14073011.log')
val head = lines.take(4)
Then we can write our map
This is an interesting question. I’m curious to know as well how this
problem can be approached.
Is there a way, perhaps, to ensure that each input file matching the glob
expression gets mapped to exactly one partition? Then you could probably
get what you want using RDD.mapPartitions().
Nick
It will certainly cause bad performance, since it reads the whole content
of a large file into one value, instead of splitting it into partitions.
Typically one file is 1 GB. Suppose we have 3 large files, in this way,
there would only be 3 key-value pairs, and thus 3 tasks at most.
2014-07-30