Yes, this is much easier. Let Nutch crawl the files and parse the files with parse-html or parse-tika and have a custom ParseFilter plugin. In there you can walk over the DOM via the passed DocumentFragment object. It is very easy to look up the HTML elements of interest. One example is the headings plugin Nutch has. It does exactly that and can serve as a template for you to work on.
Also, i'd advice to move these discussions to the user list so more users can benefit from it. Cheers, Markus -----Original message----- From: Tejas Patil<[email protected]> Sent: Friday 3rd January 2014 5:53 To: [email protected] Subject: Re: use <Map Reduce + Jsoup> to parse big Nutch/Content file Here is what I would do: If you running a crawl, let it run with the default parser. Write a nutch plugin with your customized parse implementation to evaluate your parse logic. Now get some real segments (with a subset of those million pages) and run only the bin/nutch parse command and see how good it is. That command will run your parser over the segment. Do this till you get a satisfactory parser implementation. ~tejas On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <[email protected] <mailto:[email protected]>> wrote: Hi, I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder). The size of the scraping is fairly big. Million pages per day. One thing about the HTML pages themselves is that they follow exactly the same format.. so I can write a parser in Java to parse out the info I want (say unit price, part number...etc) for one page, and that parser will work for most of the pages.. I am wondering is there some map reduce template already written so I can just replace the parser with my customized one and easily start a hadoop mapreduce job. (actually, there doesnt have to be any reduce job... in this case, we map every page to the parsed result and that is it...) I was looking at the map reduce example here: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html <https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html> But I have some problem translating that into my real-world nutch problem. I know run map reduce against Nutch binary file will be a bit different than word count. I looked at the source code of Nutch and to me, it looks like the file are a sequence files of records where each records is a key/value pair where key is text type and value is org.apache.nutch.protocol.Content type. Then how should I configure the map job so it can read in the raw big content binary file and do the Inputsplit correctly and run the map job.. Thanks a lot! /usr/bin ( Some explanations of why I decided not to write Java plugin ): I was thinking about writing a Nutch Plugin so it will be handy to parse the scraped data using Nutch command. However, the problem here is "it is hard to write a perfect parser" in one go. It probably makes a lot of sense for the people who deal with parsers a lot. You locate your HTML tag by some specific features that you think will be general... css class type, id...etc...even combining with regular expression. However, when you apply your logic to all the pages, it wont stand true for all the pages. Then you need to write many different parsers to run against the whole dataset (Million pages) in one go and see which one has the best performance. Then you run your parser against all your snapshots days * million pages.. to get the new dataset.. )

