Hi Tejas,
-- Nutch Plugin --
I got a bit confused here. Both of you (Markus Jelsma) are recommending
writing a Nutch plugin in. Does the "bin/nutch parse" part run in
distributed mode?
Correct me if I was wrong, here is my understanding of the labor behind
Nutch.
<Fetching>: both Nutch 1.7 and Nutch 2.X will run on one node, so even
if you have a cluster, only one of the node will be truly used to make the
http requests.
<Storing>: Nutch 1.7 will store the content/HTML to local disk as
default and Nutch 2.X can store the data in accumulo/hbase, which is the
"big-data-like" distributed system (also capable of storing locally MySQL
etc.).
will the <Parsing> part actually run in distributed mode if you are
using Nutch 2.X?
In another word, when you do bin/nutch parse...., it actually kicks out
thousands of map-reduce jobs to utilize the whole cluster to parse the data?
So each node in your cluster will parse a fragment of the complete
dataset when you decide to reparse all your dataset?
Otherwise, if the parsing is running only on one node, it will take
extremely long time to reparse all your data even if you finally got your
perfect parser.
/usr/bin
On Thu, Jan 2, 2014 at 9:52 PM, Tejas Patil <[email protected]>wrote:
> Here is what I would do:
> If you running a crawl, let it run with the default parser. Write a nutch
> plugin with your customized parse implementation to evaluate your parse
> logic. Now get some real segments (with a subset of those million pages)
> and run only the 'bin/nutch parse' command and see how good it is. That
> command will run your parser over the segment. Do this till you get a
> satisfactory parser implementation.
>
> ~tejas
>
>
> On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <[email protected]> wrote:
>
>> Hi,
>>
>> I have a robot that scrapes a website daily and store the HTML locally so
>> far(in nutch binary format in segment/content folder).
>>
>> The size of the scraping is fairly big. Million pages per day.
>> One thing about the HTML pages themselves is that they follow exactly the
>> same format.. so I can write a parser in Java to parse out the info I want
>> (say unit price, part number...etc) for one page, and that parser will work
>> for most of the pages..
>>
>> I am wondering is there some map reduce template already written so I can
>> just replace the parser with my customized one and easily start a hadoop
>> mapreduce job. (actually, there doesn't have to be any reduce job... in
>> this case, we map every page to the parsed result and that is it...)
>>
>> I was looking at the map reduce example here:
>> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
>> But I have some problem translating that into my real-world nutch
>> problem.
>>
>> I know run map reduce against Nutch binary file will be a bit different
>> than word count. I looked at the source code of Nutch and to me, it looks
>> like the file are a sequence files of records where each records is a
>> key/value pair where key is text type and value is
>> org.apache.nutch.protocol.Content type. Then how should I configure the map
>> job so it can read in the raw big content binary file and do the Inputsplit
>> correctly and run the map job..
>>
>> Thanks a lot!
>>
>> /usr/bin
>>
>>
>> ( Some explanations of why I decided not to write Java plugin ):
>> I was thinking about writing a Nutch Plugin so it will be handy to parse
>> the scraped data using Nutch command. However, the problem here is "it is
>> hard to write a perfect parser" in one go. It probably makes a lot of sense
>> for the people who deal with parsers a lot. You locate your HTML tag by
>> some specific features that you think will be general... css class type,
>> id...etc...even combining with regular expression. However, when you apply
>> your logic to all the pages, it won't stand true for all the pages. Then
>> you need to write many different parsers to run against the whole dataset
>> (Million pages) in one go and see which one has the best performance. Then
>> you run your parser against all your snapshots days * million pages.. to
>> get the new dataset.. )
>>
>
>