Re: [Nutch-dev] How to read data from segments

Dennis Kubes Fri, 09 Mar 2007 06:48:22 -0800


Steve Severance wrote:
> I am trying to learn the internals of Nutch and by extension Hadoop right
> now. I am implementing an algorithm that processes link and content data. I
> am stuck on how to open the ParseDatas contained in the segments. Each
> subdir of a segment (crawl_generate, etc...) contains a subdir part-00000,
> which id I understand correctly, if I had more computers as part of a hadoop
> cluster there would also be part-00001 and so on.


There is one directory for each split.  One interesting thing to note is 
that multiple writers (i.e. map and reduce tasks) can't write to the 
same file on the DFS at the same time.  So each reduce task writes out 
it's own split to its own directory.

> 
> When I try to open them with an ArrayFile.Reader it cannot find the file. I
> know that the Path class is working properly since it can enumerate sub
> directories. I tried hard coding the part-00000 in to the path but that did
> not work either. 
> 
> The code is as follows:
> 
> Path segmentDir = new Path(args[0]);
> Path pageRankDir = new Path(args[1]);
>               
> Path segmentPath = new Path(segmentDir, "parse_data/part-00000");
> ArrayFile.Reader parses = null;
> try
> {
>       parses = new
> ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString(),co
> nfig);
> }
> catch(IOException ex){
>       System.out.println("An Error Occured while opening the segment.
> Message: " + ex.getMessage());
> }
> 
> The exception reports that it cannot open the file. I also tried merging the
> segments but that did not work either. Any help would be greatly
> appreciated.

Just like Andrzej said.  It is in the outputformats and they have 
getReaders and getEntry methods.  I have a little tool that is a 
MapFileExplorer, if you want it let me know and I will send you a copy.
> 
> One more thing. As a new nutch developer I am keeping a running list of
> problems/questions that I have and their solutions. A lot of questions arise
> from not understanding how to work with the internals, specifically
> understanding the building blocks of Hadoop such as filetypes and why there
> are custom types that Hadoop uses, e.g. why Text instead of String. I
> noticed that in a mailing list post earlier this year the lack of detailed
> information for new developers was cited as a barrier to more involvement. I
> would be happy to contribute this back to the wiki if there is interest.

Absolutely.  The more documentation we have, especially for new 
developers, the better.  If you need any questions answered in doing 
this, give me a shout and I will help as much as I can.

Dennis Kubes
> 
> Regards,
> 
> Steve
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] How to read data from segments

Reply via email to