Steve Severance wrote:
> I am trying to learn the internals of Nutch and by extension Hadoop right
> now. I am implementing an algorithm that processes link and content data. I
> am stuck on how to open the ParseDatas contained in the segments. Each
> subdir of a segment (crawl_generate, etc...) contains a subdir part-00000,
> which id I understand correctly, if I had more computers as part of a hadoop
> cluster there would also be part-00001 and so on. 
>   

Correct.

> When I try to open them with an ArrayFile.Reader it cannot find the file. I
> know that the Path class is working properly since it can enumerate sub
> directories. I tried hard coding the part-00000 in to the path but that did
> not work either. 
>
> The code is as follows:
>
> Path segmentDir = new Path(args[0]);
> Path pageRankDir = new Path(args[1]);
>   

Ah-ha, pageRankDir .. ;)

>               
> Path segmentPath = new Path(segmentDir, "parse_data/part-00000");
>   

Please take a look at the class MapFileOutputFormat and 
SequenceFileOutputFormat. Both support this nested dir structure which 
is a by-product of producing the data via map-reduce, and offer methods 
for getting MapFile.Reader[] or SequenceFile.Reader[], and then getting 
a selected entry.

Cf. also the code attached to HADOOP-175 issue in JIRA.


> One more thing. As a new nutch developer I am keeping a running list of
> problems/questions that I have and their solutions. A lot of questions arise
> from not understanding how to work with the internals, specifically
> understanding the building blocks of Hadoop such as filetypes and why there
> are custom types that Hadoop uses, e.g. why Text instead of String. I
> noticed that in a mailing list post earlier this year the lack of detailed
> information for new developers was cited as a barrier to more involvement. I
> would be happy to contribute this back to the wiki if there is interest.
>   

Definitely, you are welcome to contribute in this area - this is always 
needed. Although this particular information might be more suitable for 
the Hadoop wiki ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to