use to parse big Nutch/Content file

Bin Wang Thu, 02 Jan 2014 14:49:56 -0800

Hi,

I have a robot that scrapes a website daily and store the HTML locally so
far(in nutch binary format in segment/content folder).


The size of the scraping is fairly big. Million pages per day.
One thing about the HTML pages themselves is that they follow exactly the
same format.. so I can write a parser in Java to parse out the info I want
(say unit price, part number...etc) for one page, and that parser will work
for most of the pages..

I am wondering is there some map reduce template already written so I can
just replace the parser with my customized one and easily start a hadoop
mapreduce job. (actually, there doesn't have to be any reduce job... in
this case, we map every page to the parsed result and that is it...)

I was looking at the map reduce example here:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
But I have some problem translating that into my real-world nutch problem.

I know run map reduce against Nutch binary file will be a bit different
than word count. I looked at the source code of Nutch and to me, it looks
like the file are a sequence files of records where each records is a
key/value pair where key is text type and value is
org.apache.nutch.protocol.Content type. Then how should I configure the map
job so it can read in the raw big content binary file and do the Inputsplit
correctly and run the map job..

Thanks a lot!

/usr/bin


( Some explanations of why I decided not to write Java plugin ):
I was thinking about writing a Nutch Plugin so it will be handy to parse
the scraped data using Nutch command. However, the problem here is "it is
hard to write a perfect parser" in one go. It probably makes a lot of sense
for the people who deal with parsers a lot. You locate your HTML tag by
some specific features that you think will be general... css class type,
id...etc...even combining with regular expression. However, when you apply
your logic to all the pages, it won't stand true for all the pages. Then
you need to write many different parsers to run against the whole dataset
(Million pages) in one go and see which one has the best performance. Then
you run your parser against all your snapshots days * million pages.. to
get the new dataset.. )

use to parse big Nutch/Content file

Reply via email to