I am trying to learn the internals of Nutch and by extension Hadoop right now. I am implementing an algorithm that processes link and content data. I am stuck on how to open the ParseDatas contained in the segments. Each subdir of a segment (crawl_generate, etc...) contains a subdir part-00000, which id I understand correctly, if I had more computers as part of a hadoop cluster there would also be part-00001 and so on.
When I try to open them with an ArrayFile.Reader it cannot find the file. I know that the Path class is working properly since it can enumerate sub directories. I tried hard coding the part-00000 in to the path but that did not work either. The code is as follows: Path segmentDir = new Path(args[0]); Path pageRankDir = new Path(args[1]); Path segmentPath = new Path(segmentDir, "parse_data/part-00000"); ArrayFile.Reader parses = null; try { parses = new ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString(),co nfig); } catch(IOException ex){ System.out.println("An Error Occured while opening the segment. Message: " + ex.getMessage()); } The exception reports that it cannot open the file. I also tried merging the segments but that did not work either. Any help would be greatly appreciated. One more thing. As a new nutch developer I am keeping a running list of problems/questions that I have and their solutions. A lot of questions arise from not understanding how to work with the internals, specifically understanding the building blocks of Hadoop such as filetypes and why there are custom types that Hadoop uses, e.g. why Text instead of String. I noticed that in a mailing list post earlier this year the lack of detailed information for new developers was cited as a barrier to more involvement. I would be happy to contribute this back to the wiki if there is interest. Regards, Steve ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers