Steve Severance wrote: > I am trying to learn the internals of Nutch and by extension Hadoop right > now. I am implementing an algorithm that processes link and content data. I > am stuck on how to open the ParseDatas contained in the segments. Each > subdir of a segment (crawl_generate, etc...) contains a subdir part-00000, > which id I understand correctly, if I had more computers as part of a hadoop > cluster there would also be part-00001 and so on. >
Correct. > When I try to open them with an ArrayFile.Reader it cannot find the file. I > know that the Path class is working properly since it can enumerate sub > directories. I tried hard coding the part-00000 in to the path but that did > not work either. > > The code is as follows: > > Path segmentDir = new Path(args[0]); > Path pageRankDir = new Path(args[1]); > Ah-ha, pageRankDir .. ;) > > Path segmentPath = new Path(segmentDir, "parse_data/part-00000"); > Please take a look at the class MapFileOutputFormat and SequenceFileOutputFormat. Both support this nested dir structure which is a by-product of producing the data via map-reduce, and offer methods for getting MapFile.Reader[] or SequenceFile.Reader[], and then getting a selected entry. Cf. also the code attached to HADOOP-175 issue in JIRA. > One more thing. As a new nutch developer I am keeping a running list of > problems/questions that I have and their solutions. A lot of questions arise > from not understanding how to work with the internals, specifically > understanding the building blocks of Hadoop such as filetypes and why there > are custom types that Hadoop uses, e.g. why Text instead of String. I > noticed that in a mailing list post earlier this year the lack of detailed > information for new developers was cited as a barrier to more involvement. I > would be happy to contribute this back to the wiki if there is interest. > Definitely, you are welcome to contribute in this area - this is always needed. Although this particular information might be more suitable for the Hadoop wiki ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers