Re: Is it possible to read file head in each partition?

2014-07-30 Thread Cheng Lian
What's the format of the file header? Is it possible to filter them out by prefix string matching or regex? On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO raofeng...@gmail.com wrote: It will certainly cause bad performance, since it reads the whole content of a large file into one value,

Re: Is it possible to read file head in each partition?

2014-07-30 Thread Fengyun RAO
of course we can filter them out. A typical file head is as below: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:00 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus

Is it possible to read file head in each partition?

2014-07-29 Thread Fengyun RAO
Hi, all We are migrating from mapreduce to spark, and encountered a problem. Our input files are IIS logs with file head. It's easy to get the file head if we process only one file, e.g. val lines = sc.textFile('hdfs://*/u_ex14073011.log') val head = lines.take(4) Then we can write our map

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Nicholas Chammas
This is an interesting question. I’m curious to know as well how this problem can be approached. Is there a way, perhaps, to ensure that each input file matching the glob expression gets mapped to exactly one partition? Then you could probably get what you want using RDD.mapPartitions(). Nick ​

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Fengyun RAO
It will certainly cause bad performance, since it reads the whole content of a large file into one value, instead of splitting it into partitions. Typically one file is 1 GB. Suppose we have 3 large files, in this way, there would only be 3 key-value pairs, and thus 3 tasks at most. 2014-07-30