Steve Severance wrote: > I am trying to learn the internals of Nutch and by extension Hadoop right > now. I am implementing an algorithm that processes link and content data. I > am stuck on how to open the ParseDatas contained in the segments. Each > subdir of a segment (crawl_generate, etc...) contains a subdir part-00000, > which id I understand correctly, if I had more computers as part of a hadoop > cluster there would also be part-00001 and so on.
There is one directory for each split. One interesting thing to note is that multiple writers (i.e. map and reduce tasks) can't write to the same file on the DFS at the same time. So each reduce task writes out it's own split to its own directory. > > When I try to open them with an ArrayFile.Reader it cannot find the file. I > know that the Path class is working properly since it can enumerate sub > directories. I tried hard coding the part-00000 in to the path but that did > not work either. > > The code is as follows: > > Path segmentDir = new Path(args[0]); > Path pageRankDir = new Path(args[1]); > > Path segmentPath = new Path(segmentDir, "parse_data/part-00000"); > ArrayFile.Reader parses = null; > try > { > parses = new > ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString(),co > nfig); > } > catch(IOException ex){ > System.out.println("An Error Occured while opening the segment. > Message: " + ex.getMessage()); > } > > The exception reports that it cannot open the file. I also tried merging the > segments but that did not work either. Any help would be greatly > appreciated. Just like Andrzej said. It is in the outputformats and they have getReaders and getEntry methods. I have a little tool that is a MapFileExplorer, if you want it let me know and I will send you a copy. > > One more thing. As a new nutch developer I am keeping a running list of > problems/questions that I have and their solutions. A lot of questions arise > from not understanding how to work with the internals, specifically > understanding the building blocks of Hadoop such as filetypes and why there > are custom types that Hadoop uses, e.g. why Text instead of String. I > noticed that in a mailing list post earlier this year the lack of detailed > information for new developers was cited as a barrier to more involvement. I > would be happy to contribute this back to the wiki if there is interest. Absolutely. The more documentation we have, especially for new developers, the better. If you need any questions answered in doing this, give me a shout and I will help as much as I can. Dennis Kubes > > Regards, > > Steve > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers