Re: [Nutch-dev] How to read data from segments

Dennis Kubes Fri, 09 Mar 2007 14:24:50 -0800


Steve Severance wrote:
>> -----Original Message-----
>> From: Dennis Kubes [mailto:[EMAIL PROTECTED]
>> Sent: Friday, March 09, 2007 9:47 AM
>> To: nutch-dev@lucene.apache.org
>> Subject: Re: How to read data from segments
>>
>>
>>
>> Steve Severance wrote:
>>> I am trying to learn the internals of Nutch and by extension Hadoop
>> right
>>> now. I am implementing an algorithm that processes link and content
>> data. I
>>> am stuck on how to open the ParseDatas contained in the segments.
>> Each
>>> subdir of a segment (crawl_generate, etc...) contains a subdir part-
>> 00000,
>>> which id I understand correctly, if I had more computers as part of a
>> hadoop
>>> cluster there would also be part-00001 and so on.
>> There is one directory for each split.  One interesting thing to note
>> is
>> that multiple writers (i.e. map and reduce tasks) can't write to the
>> same file on the DFS at the same time.  So each reduce task writes out
>> it's own split to its own directory.
> 
> Does this mean that there might be some parts that are Map outputs and
> others that are Reduce outputs?


Sorry, no I should have been more specific.  The results such as 
part-xxx are only reduce results.
> 
>>> When I try to open them with an ArrayFile.Reader it cannot find the
>> file. I
>>> know that the Path class is working properly since it can enumerate
>> sub
>>> directories. I tried hard coding the part-00000 in to the path but
>> that did
>>> not work either.
>>>
>>> The code is as follows:
>>>
>>> Path segmentDir = new Path(args[0]);
>>> Path pageRankDir = new Path(args[1]);
>>>
>>> Path segmentPath = new Path(segmentDir, "parse_data/part-00000");
>>> ArrayFile.Reader parses = null;
>>> try
>>> {
>>>     parses = new
>>>
>> ArrayFile.Reader(segmentPath.getFileSystem(config),segmentPath.toString
>> (),co
>>> nfig);
>>> }
>>> catch(IOException ex){
>>>     System.out.println("An Error Occured while opening the segment.
>>> Message: " + ex.getMessage());
>>> }
>>>
>>> The exception reports that it cannot open the file. I also tried
>> merging the
>>> segments but that did not work either. Any help would be greatly
>>> appreciated.
>> Just like Andrzej said.  It is in the outputformats and they have
>> getReaders and getEntry methods.  I have a little tool that is a
>> MapFileExplorer, if you want it let me know and I will send you a copy.
> 
> Yes, that would be great if you are willing to share it. I was already
> thinking about writing something similar.

Will do.
> 
>>> One more thing. As a new nutch developer I am keeping a running list
>> of
>>> problems/questions that I have and their solutions. A lot of
>> questions arise
>>> from not understanding how to work with the internals, specifically
>>> understanding the building blocks of Hadoop such as filetypes and why
>> there
>>> are custom types that Hadoop uses, e.g. why Text instead of String. I
>>> noticed that in a mailing list post earlier this year the lack of
>> detailed
>>> information for new developers was cited as a barrier to more
>> involvement. I
>>> would be happy to contribute this back to the wiki if there is
>> interest.
>>
>> Absolutely.  The more documentation we have, especially for new
>> developers, the better.  If you need any questions answered in doing
>> this, give me a shout and I will help as much as I can.
> 
> What is the best way to proceed with this? Should I make a new wiki page?
> Here is what I am thinking:
> Have an overview of Nutch and Hadoop. This will include code samples of
> basic tasks like getting data. And by overview I mean a detailed overview so
> that someone without distributed computing or search experience will be able
> to understand. It will not include IR basics as those are fairly well
> documented elsewere. The Hadoop one might want to live on its own wiki. I
> also am going to write up my implementation of PageRank as a tutorial since
> it will cover I think a lot of Hadoop and Nutch basics, including Hadoop
> types, using Hadoop files and MapReduce.

Yes the wiki is the best place for this currently.  I think a detailed 
overview would be great.

Dennis Kubes

> 
>> Dennis Kubes
>>> Regards,
>>>
>>> Steve
>>>
> 
> Steve
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] How to read data from segments

Reply via email to