Here is what I would do: If you running a crawl, let it run with the default parser. Write a nutch plugin with your customized parse implementation to evaluate your parse logic. Now get some real segments (with a subset of those million pages) and run only the 'bin/nutch parse' command and see how good it is. That command will run your parser over the segment. Do this till you get a satisfactory parser implementation.
~tejas On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <binwang...@gmail.com> wrote: > Hi, > > I have a robot that scrapes a website daily and store the HTML locally so > far(in nutch binary format in segment/content folder). > > The size of the scraping is fairly big. Million pages per day. > One thing about the HTML pages themselves is that they follow exactly the > same format.. so I can write a parser in Java to parse out the info I want > (say unit price, part number...etc) for one page, and that parser will work > for most of the pages.. > > I am wondering is there some map reduce template already written so I can > just replace the parser with my customized one and easily start a hadoop > mapreduce job. (actually, there doesn't have to be any reduce job... in > this case, we map every page to the parsed result and that is it...) > > I was looking at the map reduce example here: > https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html > But I have some problem translating that into my real-world nutch problem. > > I know run map reduce against Nutch binary file will be a bit different > than word count. I looked at the source code of Nutch and to me, it looks > like the file are a sequence files of records where each records is a > key/value pair where key is text type and value is > org.apache.nutch.protocol.Content type. Then how should I configure the map > job so it can read in the raw big content binary file and do the Inputsplit > correctly and run the map job.. > > Thanks a lot! > > /usr/bin > > > ( Some explanations of why I decided not to write Java plugin ): > I was thinking about writing a Nutch Plugin so it will be handy to parse > the scraped data using Nutch command. However, the problem here is "it is > hard to write a perfect parser" in one go. It probably makes a lot of sense > for the people who deal with parsers a lot. You locate your HTML tag by > some specific features that you think will be general... css class type, > id...etc...even combining with regular expression. However, when you apply > your logic to all the pages, it won't stand true for all the pages. Then > you need to write many different parsers to run against the whole dataset > (Million pages) in one go and see which one has the best performance. Then > you run your parser against all your snapshots days * million pages.. to > get the new dataset.. ) >