> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 09, 2007 9:47 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: How to read data from segments
> 
> 
> 
> Steve Severance wrote:
> > I am trying to learn the internals of Nutch and by extension Hadoop
> right
> > now. I am implementing an algorithm that processes link and content
> data. I
> > am stuck on how to open the ParseDatas contained in the segments.
> Each
> > subdir of a segment (crawl_generate, etc...) contains a subdir part-
> 00000,
> > which id I understand correctly, if I had more computers as part of a
> hadoop
> > cluster there would also be part-00001 and so on.
> 
> There is one directory for each split.  One interesting thing to note
> is
> that multiple writers (i.e. map and reduce tasks) can't write to the
> same file on the DFS at the same time.  So each reduce task writes out
> it's own split to its own directory.

Does this mean that there might be some parts that are Map outputs and
others that are Reduce outputs? 

> > When I try to open them with an ArrayFile.Reader it cannot find the
> file. I
> > know that the Path class is working properly since it can enumerate
> sub
> > directories. I tried hard coding the part-00000 in to the path but
> that did
> > not work either.
> >
> > The code is as follows:
> >
> > Path segmentDir = new Path(args[0]);
> > Path pageRankDir = new Path(args[1]);
> >
> > Path segmentPath = new Path(segmentDir, "parse_data/part-00000");
> > ArrayFile.Reader parses = null;
> > try
> > {
> >     parses = new
> >
> ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString
> (),co
> > nfig);
> > }
> > catch(IOException ex){
> >     System.out.println("An Error Occured while opening the segment.
> > Message: " + ex.getMessage());
> > }
> >
> > The exception reports that it cannot open the file. I also tried
> merging the
> > segments but that did not work either. Any help would be greatly
> > appreciated.
> 
> Just like Andrzej said.  It is in the outputformats and they have
> getReaders and getEntry methods.  I have a little tool that is a
> MapFileExplorer, if you want it let me know and I will send you a copy.

Yes, that would be great if you are willing to share it. I was already
thinking about writing something similar.

> >
> > One more thing. As a new nutch developer I am keeping a running list
> of
> > problems/questions that I have and their solutions. A lot of
> questions arise
> > from not understanding how to work with the internals, specifically
> > understanding the building blocks of Hadoop such as filetypes and why
> there
> > are custom types that Hadoop uses, e.g. why Text instead of String. I
> > noticed that in a mailing list post earlier this year the lack of
> detailed
> > information for new developers was cited as a barrier to more
> involvement. I
> > would be happy to contribute this back to the wiki if there is
> interest.
> 
> Absolutely.  The more documentation we have, especially for new
> developers, the better.  If you need any questions answered in doing
> this, give me a shout and I will help as much as I can.

What is the best way to proceed with this? Should I make a new wiki page?
Here is what I am thinking:
Have an overview of Nutch and Hadoop. This will include code samples of
basic tasks like getting data. And by overview I mean a detailed overview so
that someone without distributed computing or search experience will be able
to understand. It will not include IR basics as those are fairly well
documented elsewere. The Hadoop one might want to live on its own wiki. I
also am going to write up my implementation of PageRank as a tutorial since
it will cover I think a lot of Hadoop and Nutch basics, including Hadoop
types, using Hadoop files and MapReduce.

> 
> Dennis Kubes
> >
> > Regards,
> >
> > Steve
> >

Steve


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to