Re: [Nutch-general] extracting urls into text files

cha Mon, 19 Mar 2007 07:54:55 -0800

Thanks enis,

am getting some idea from that..
Can you tell me in which class i should implement that.
I havent have hadoop install on my box.


Awaiting,
Cha

Enis Soztutar wrote:
> 
> the crawldb is a serialization of a hadoop's 
> org.apache.hadoop.io.MapFile object. This structure contains two 
> SequenceFiles, one for data and one for index. This is an excerpt from 
> the javadoc of the MapFile class:
> 
> A file-based map from keys to values.
>  *
>  * <p>A map is a directory containing two files, the <code>data</code>
> file,
>  * containing all keys and values in the map, and a smaller 
> <code>index</code>
>  * file, containing a fraction of the keys.  The fraction is determined by
>  * [EMAIL PROTECTED] Writer#getIndexInterval()}.
> 
> MapFile.Reader class is for reading the contents of the map file. By 
> using this class, you can enumerate all the entries of the map file. And 
> since the keys of the crawldb are Text objects containing urls, you can 
> just dump the keys one by one to another file. Try the following :
> 
> 
> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
> 
>         Class keyC = reader.getKeyClass();
>         Class valueC = reader.getValueClass();
> 
>         while (true) {
>             WritableComparable key = null;
>             Writable value = null;
>             try {
>                 key = (WritableComparable)keyC.newInstance();
>                 value = (Writable)valueC.newInstance();
>             } catch (Exception ex) {
>                 ex.printStackTrace();
>                 System.exit(-1);
>             }
> 
>             try {   
>                 if (!reader.next(key, value)) {
>                     break;
>                 }
> 
>                 out.println(key);
>                 out.println(value);
>             } catch (Exception e) {
>                 e.printStackTrace();
>                 out.println("Exception occured. " + e);
>                 break;
>             }
>         }
> 
> This code is just for demonstration, of course you can customize it for 
> you needs, for example printing in xml format. you can check the 
> javadocs of CrawlDatum, Crawldb, Text,  MapFile, SequenceFile classes 
> for further insight.
> 
> 
> cha wrote:
>> Hi Enis,
>>
>> I cant still able to figured it out how it can be done..Can you explain
>> elaborately.
>> please..
>>
>> Regards,
>> Chandresh
>>
>> Enis Soztutar wrote:
>>   
>>> cha wrote:
>>>     
>>>> hi sagar,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> Actually am trying to digg out the code in the same class..but not able
>>>> to
>>>> figure it out from where Urls has been read.
>>>>
>>>> When you dump the database, the file contains :
>>>>
>>>> http://blog.cha.com/       Version: 4
>>>> Status: 2 (DB_fetched)
>>>> Fetch time: Fri Apr 13 15:58:28 IST 2007
>>>> Modified time: Thu Jan 01 05:30:00 IST 1970
>>>> Retries since fetch: 0
>>>> Retry interval: 30.0 days
>>>> Score: 0.062367838
>>>> Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
>>>> Metadata: null
>>>>
>>>> I figured it out rest of the things but not sure how the Url name has
>>>> been
>>>> read..
>>>>
>>>> I just want plain urls only  in the text file..It is possible that i
>>>> can
>>>> use
>>>> to write url in some xml formats..If yes then how?
>>>>
>>>> Awaiting,
>>>>
>>>> Chandresh
>>>>
>>>>   
>>>>       
>>> Hi, crawldb is a actually a map file, which has urls as keys(Text class) 
>>> and CrawlDatum objects as values. You can write a generic map file 
>>> reader and then which extracts the keys and dumps to a file.
>>>
>>>
>>>
>>>
>>>     
>>
>>   
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extracting-urls-into-text-files-tf3409030.html#a9554858
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] extracting urls into text files

Reply via email to