I am trying to learn the internals of Nutch and by extension Hadoop right
now. I am implementing an algorithm that processes link and content data. I
am stuck on how to open the ParseDatas contained in the segments. Each
subdir of a segment (crawl_generate, etc...) contains a subdir part-00000,
which id I understand correctly, if I had more computers as part of a hadoop
cluster there would also be part-00001 and so on. 

When I try to open them with an ArrayFile.Reader it cannot find the file. I
know that the Path class is working properly since it can enumerate sub
directories. I tried hard coding the part-00000 in to the path but that did
not work either. 

The code is as follows:

Path segmentDir = new Path(args[0]);
Path pageRankDir = new Path(args[1]);
                
Path segmentPath = new Path(segmentDir, "parse_data/part-00000");
ArrayFile.Reader parses = null;
try
{
        parses = new
ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString(),co
nfig);
}
catch(IOException ex){
        System.out.println("An Error Occured while opening the segment.
Message: " + ex.getMessage());
}

The exception reports that it cannot open the file. I also tried merging the
segments but that did not work either. Any help would be greatly
appreciated.

One more thing. As a new nutch developer I am keeping a running list of
problems/questions that I have and their solutions. A lot of questions arise
from not understanding how to work with the internals, specifically
understanding the building blocks of Hadoop such as filetypes and why there
are custom types that Hadoop uses, e.g. why Text instead of String. I
noticed that in a mailing list post earlier this year the lack of detailed
information for new developers was cited as a barrier to more involvement. I
would be happy to contribute this back to the wiki if there is interest.

Regards,

Steve


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to