Thanks for the advice. But my plan is to crawl news rss feeds every 30 minutes. 
So I'd be downloading at most 5 to 10 news articles per map task (since news 
aren't published that often). So I guess I won't have to worry to much about 
the crawling dealy. 
I thought it would be a good idea to make a dictionary during the crawling 
process. Because I will be needing the a dictionary to calculate tf-idf and I 
didn't want to have to go through the whole repository everytime a news aricle 
is added. 
If I crawl and make a dictionary at the same time, all I need to do to make a 
dictionary is to merge the new ones (which are generated every 30 minutes) with 
the existing dictionary which I guess will be computationally cheap. 

Ed

From mp2893's iPhone

On 2010. 12. 11., at 오전 3:42, Ted Dunning <[email protected]> wrote:

> Regarding the idea of doing word counts during the crawl, I think you are
> motivated by the best of principles (read
> input only once), but in practice, you will be doing many small crawls and
> saving the content.  Word counting
> should probably not be tied too closely to the crawl because the crawl can
> be delayed arbitrarily.  Better to have
> a good content repository that is updated as often as crawls complete and
> run other processing against the
> repository whenever it seems like a good idea.
> 
> 2010/12/10 Edward Choi <[email protected]>
> 
>> Thanks for the tip. I guess it's a little different project from Nutch. My
>> understanding is that while Nutch tries to implement a whole web search
>> package, Bixo focuses on the crawling part. I should look into both projects
>> more deeply. Thanks again!!
>> 
>> Ed
>> 
>> From mp2893's iPhone
>> 
>> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[email protected]> wrote:
>> 
>>> That is definitely possible, but may not be very desirable.
>>> 
>>> Take a look at the Bixo project for a full-scale crawler.  There is a lot
>> of
>>> subtlety in the fetching of URL's
>>> due to the varying quality of different sites and the interaction with
>> crawl
>>> choking due to robots.txt considerations.
>>> 
>>> http://bixo.101tec.com/
>>> 
>>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[email protected]> wrote:
>>> 
>>>> So my design is:
>>>> Map phase ==> crawl news articles, process text, write the result to a
>>>> file.
>>>>      II
>>>>      II     pass (term, term_frequency) pair to the Reducer
>>>>      II
>>>>      V
>>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
>>>> dictionary
>>>> 
>>>> Is this at all possible? Or is it inherently impossible due to the
>>>> structure
>>>> of Hadoop?
>>>> 
>> 

Reply via email to