Steve Severance wrote: >> -----Original Message----- >> From: Dennis Kubes [mailto:[EMAIL PROTECTED] >> Sent: Friday, March 09, 2007 9:47 AM >> To: nutch-dev@lucene.apache.org >> Subject: Re: How to read data from segments >> >> >> >> Steve Severance wrote: >>> I am trying to learn the internals of Nutch and by extension Hadoop >> right >>> now. I am implementing an algorithm that processes link and content >> data. I >>> am stuck on how to open the ParseDatas contained in the segments. >> Each >>> subdir of a segment (crawl_generate, etc...) contains a subdir part- >> 00000, >>> which id I understand correctly, if I had more computers as part of a >> hadoop >>> cluster there would also be part-00001 and so on. >> There is one directory for each split. One interesting thing to note >> is >> that multiple writers (i.e. map and reduce tasks) can't write to the >> same file on the DFS at the same time. So each reduce task writes out >> it's own split to its own directory. > > Does this mean that there might be some parts that are Map outputs and > others that are Reduce outputs?
Sorry, no I should have been more specific. The results such as part-xxx are only reduce results. > >>> When I try to open them with an ArrayFile.Reader it cannot find the >> file. I >>> know that the Path class is working properly since it can enumerate >> sub >>> directories. I tried hard coding the part-00000 in to the path but >> that did >>> not work either. >>> >>> The code is as follows: >>> >>> Path segmentDir = new Path(args[0]); >>> Path pageRankDir = new Path(args[1]); >>> >>> Path segmentPath = new Path(segmentDir, "parse_data/part-00000"); >>> ArrayFile.Reader parses = null; >>> try >>> { >>> parses = new >>> >> ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString >> (),co >>> nfig); >>> } >>> catch(IOException ex){ >>> System.out.println("An Error Occured while opening the segment. >>> Message: " + ex.getMessage()); >>> } >>> >>> The exception reports that it cannot open the file. I also tried >> merging the >>> segments but that did not work either. Any help would be greatly >>> appreciated. >> Just like Andrzej said. It is in the outputformats and they have >> getReaders and getEntry methods. I have a little tool that is a >> MapFileExplorer, if you want it let me know and I will send you a copy. > > Yes, that would be great if you are willing to share it. I was already > thinking about writing something similar. Will do. > >>> One more thing. As a new nutch developer I am keeping a running list >> of >>> problems/questions that I have and their solutions. A lot of >> questions arise >>> from not understanding how to work with the internals, specifically >>> understanding the building blocks of Hadoop such as filetypes and why >> there >>> are custom types that Hadoop uses, e.g. why Text instead of String. I >>> noticed that in a mailing list post earlier this year the lack of >> detailed >>> information for new developers was cited as a barrier to more >> involvement. I >>> would be happy to contribute this back to the wiki if there is >> interest. >> >> Absolutely. The more documentation we have, especially for new >> developers, the better. If you need any questions answered in doing >> this, give me a shout and I will help as much as I can. > > What is the best way to proceed with this? Should I make a new wiki page? > Here is what I am thinking: > Have an overview of Nutch and Hadoop. This will include code samples of > basic tasks like getting data. And by overview I mean a detailed overview so > that someone without distributed computing or search experience will be able > to understand. It will not include IR basics as those are fairly well > documented elsewere. The Hadoop one might want to live on its own wiki. I > also am going to write up my implementation of PageRank as a tutorial since > it will cover I think a lot of Hadoop and Nutch basics, including Hadoop > types, using Hadoop files and MapReduce. Yes the wiki is the best place for this currently. I think a detailed overview would be great. Dennis Kubes > >> Dennis Kubes >>> Regards, >>> >>> Steve >>> > > Steve > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers