Hi Andrzej,
        Thanks for the reply. I have a couple more questions that I am not 
quite sure about. Mapfile.Reader[] represents the individual readers for each 
piece of a MapFile such that part-00000, part-00001 are each represented by a 
reader? In that case is the correct path to the segment something like 
"crawl/segments/<some segment>" and that is the path that I should pass? 
Currently it is returning 0 readers. 

Also generally on PageRank, I implemented a version in .net on mapreduce for 
another project that I was working on. However that was at my last job and I 
have started a new company that is developing a vertical search on 
nutch/hadoop. My basic idea of how implement PageRank for nutch is as follows:

Step 1: Build basic data
        I have created a PageRankDatum class to hold the information that 
PageRank requires for its computation. PageRankDatum contains the PageRank 
value and the number of outbound links. This would enable the key/value pair to 
be <Url,PageRankDatum>

Step 2: Compute the ranks
        Collect the resulting ranks to the output and write them out. Reduce 
would in effect be an Identity function I think. With this step we need to look 
up the inbound links for a Url and then how many other outbound links each link 
has. That was the purpose of storing the outbound link count in addition to the 
page rank. If I have a Hadoop cluster (currently I am running this on my dev 
machine, more machines on the way for testing) is the linkDB accessible from 
all nodes? I am thinking that the PageRankDb will work basically the sameway. 
After step 1 write it out so that it will be accessible. Also several papers 
have shown that in parallel computation of PageRank that being able to look up 
the ranks that have been computed in other nodes can lead to faster conversion. 
Is this possible in the map reduce model?

Regards,

Steve
-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 08, 2007 4:43 PM
To: nutch-dev@lucene.apache.org
Subject: Re: How to read data from segments

Steve Severance wrote:
> I am trying to learn the internals of Nutch and by extension Hadoop right
> now. I am implementing an algorithm that processes link and content data. I
> am stuck on how to open the ParseDatas contained in the segments. Each
> subdir of a segment (crawl_generate, etc...) contains a subdir part-00000,
> which id I understand correctly, if I had more computers as part of a hadoop
> cluster there would also be part-00001 and so on. 
>   

Correct.

> When I try to open them with an ArrayFile.Reader it cannot find the file. I
> know that the Path class is working properly since it can enumerate sub
> directories. I tried hard coding the part-00000 in to the path but that did
> not work either. 
>
> The code is as follows:
>
> Path segmentDir = new Path(args[0]);
> Path pageRankDir = new Path(args[1]);
>   

Ah-ha, pageRankDir .. ;)

>               
> Path segmentPath = new Path(segmentDir, "parse_data/part-00000");
>   

Please take a look at the class MapFileOutputFormat and 
SequenceFileOutputFormat. Both support this nested dir structure which 
is a by-product of producing the data via map-reduce, and offer methods 
for getting MapFile.Reader[] or SequenceFile.Reader[], and then getting 
a selected entry.

Cf. also the code attached to HADOOP-175 issue in JIRA.


> One more thing. As a new nutch developer I am keeping a running list of
> problems/questions that I have and their solutions. A lot of questions arise
> from not understanding how to work with the internals, specifically
> understanding the building blocks of Hadoop such as filetypes and why there
> are custom types that Hadoop uses, e.g. why Text instead of String. I
> noticed that in a mailing list post earlier this year the lack of detailed
> information for new developers was cited as a barrier to more involvement. I
> would be happy to contribute this back to the wiki if there is interest.
>   

Definitely, you are welcome to contribute in this area - this is always 
needed. Although this particular information might be more suitable for 
the Hadoop wiki ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to