Yes, this is much easier. Let Nutch crawl the files and parse the files with 
parse-html or parse-tika and have a custom ParseFilter plugin. In there you can 
walk over the DOM via the passed DocumentFragment object. It is very easy to 
look up the HTML elements of interest. One example is the headings plugin Nutch 
has. It does exactly that and can serve as a template for you to work on.

Also, i'd advice to move these discussions to the user list so more users can 
benefit from it.

Cheers,
Markus

-----Original message-----
From: Tejas Patil<[email protected]>
Sent: Friday 3rd January 2014 5:53
To: [email protected]
Subject: Re: use <Map Reduce + Jsoup> to parse big Nutch/Content file

Here is what I would do:

If you running a crawl, let it run with the default parser. Write a nutch 
plugin with your customized parse implementation to evaluate your parse logic. 
Now get some real segments (with a subset of those million pages) and run only 
the bin/nutch parse command and see how good it is. That command will run your 
parser over the segment. Do this till you get a satisfactory parser 
implementation.

~tejas

On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <[email protected] 
<mailto:[email protected]>> wrote:

Hi,

I have a robot that scrapes a website daily and store the HTML locally so 
far(in nutch binary format in segment/content folder).

The size of the scraping is fairly big. Million pages per day.

One thing about the HTML pages themselves is that they follow exactly the same 
format.. so I can write a parser in Java to parse out the info I want (say unit 
price, part number...etc) for one page, and that parser will work for most of 
the pages..

I am wondering is there some map reduce template already written so I can just 
replace the parser with my customized one and easily start a hadoop mapreduce 
job. (actually, there doesnt have to be any reduce job... in this case, we map 
every page to the parsed result and that is it...)

I was looking at the map reduce example here: 
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html 
<https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html>

But I have some problem translating that into my real-world nutch problem.

I know run map reduce against Nutch binary file will be a bit different than 
word count. I looked at the source code of Nutch and to me, it looks like the 
file are a sequence files of records where each records is a key/value pair 
where key is text type and value is org.apache.nutch.protocol.Content type. 
Then how should I configure the map job so it can read in the raw big content 
binary file and do the Inputsplit correctly and run the map job..

Thanks a lot!

/usr/bin

( Some explanations of why I decided not to write Java plugin ):

I was thinking about writing a Nutch Plugin so it will be handy to parse the 
scraped data using Nutch command. However, the problem here is "it is hard to 
write a perfect parser" in one go. It probably makes a lot of sense for the 
people who deal with parsers a lot. You locate your HTML tag by some specific 
features that you think will be general... css class type, id...etc...even 
combining with regular expression. However, when you apply your logic to all 
the pages, it wont stand true for all the pages. Then you need to write many 
different parsers to run against the whole dataset (Million pages) in one go 
and see which one has the best performance. Then you run your parser against 
all your snapshots days * million pages.. to get the new dataset.. )


Reply via email to